library(httr2)
library(ellmer)
library(duckdb)
library(tibble)
Rapid RAG Prototyping
ellmer
and duckdb
in R
One of the standout features of R is its ability to rapidly prototype algorithms and applications. In this blog post, I’ll show how the recently released ellmer
package and DuckDB can be combined to build a large language model (LLM) enhanced with domain-specific information.
A common challenge with LLMs is their inability to reference specific, up-to-date information beyond their training data. This is where retrieval-augmented generation (RAG) comes into play. RAG systems bridge this gap by integrating LLMs with external knowledge bases, enabling them to generate more accurate and contextually grounded responses.
The ellmer
package in R simplifies working with LLMs by offering a seamless interface to multiple providers. It comes with built-in features like tool calling and structured data extraction, making it easy to extend LLM capabilities. On the other hand, DuckDB is a high-performance, embedded analytics database that runs directly within your application. Known for its ability to process large datasets efficiently, DuckDB can query diverse file formats and, with its vector search extension, becomes an excellent lightweight tool for implementing retrieval-based features.
By combining ellmer
and DuckDB, you get a powerful stack for quickly prototyping LLM-powered applications. Whether you’re building a proof-of-concept or laying the groundwork for a more complex system, this setup offers the flexibility and performance to bring your ideas to life—fast.
Response without Augmentation
Throughout this post, I’m using OpenAI’s API. The integration is handled seamlessly through ellmer
, which provides a clean interface for interacting with various LLM providers. As an example, we start by asking Chat what it knows about me:
<- chat_openai(model = "gpt-4o-mini")
chat $chat("Who is Christoph Scheuch?") chat
As of my last update in October 2021, Christoph Scheuch might not be widely
recognized in mainstream sources or might not have significant public presence,
as information about individuals can vary widely depending on their field of
expertise, geographic region, and the context in which they are mentioned. If
he's a notable figure in a specific niche or has gained prominence after my
last update, I wouldn't have information on that.
Please provide more context or specify which Christoph Scheuch you are
referring to, and I'll do my best to assist you!
Too bad, I’m not a widely recognized public figure, so Chat cannot say anything about me. This example perfectly illustrates a common limitation of LLMs - while they excel at general knowledge and patterns learned during training, they lack access to specific, current, or private information. This is where RAG becomes invaluable. By augmenting the model’s capabilities with our own knowledge base, we can help it provide accurate, contextual responses about topics it wasn’t originally trained on.
Store Embeddings
Embeddings are numerical representations of text that capture semantic meaning in a way that computers can process. They convert words and sentences into high-dimensional vectors where similar meanings result in similar vectors. This mathematical representation is crucial for implementing efficient information retrieval in RAG systems.
I’m using the text-embedding-3-small
model that is optimized for latency and costs. This model returns a vector of length 1536, striking a balance between computational efficiency and semantic representation quality. While larger models might offer marginally better performance, they often come with increased latency and cost. Feel free to experiment with other models based on your specific needs.
<- function(
get_embedding
text,model = "text-embedding-3-small",
api_key = Sys.getenv("OPENAI_API_KEY")
) {
<- "https://api.openai.com/v1"
base_url <- request(base_url)
req <- req_url_path_append(req, "/embeddings")
req <- req_auth_bearer_token(req, api_key)
req <- req_body_json(req, list(
req input = as.list(text),
model = model
))
<- req_perform(req)
resp <- resp_body_json(resp)
json
<- as.vector(unlist(json$data[[1]]$embedding))
embedding
embedding }
Here is an example that demonstrates how a simple question gets transformed into its vector representation. For brevity, I just print the first 5 numbers:
<- get_embedding("Who is Christoph Scheuch?")
embedding 1:5] embedding[
[1] -0.015694630 -0.026622834 -0.008347419 0.025807848 -0.036254470
These numbers obviously don’t mean anything to us humans. But through vector similarity, they are tremendously useful to find relevant content. When we later search for information, we’ll compare these vector representations to find semantically similar content, rather than relying on simple keyword matching. This allows our RAG system to understand and retrieve contextually relevant information even when the exact words don’t match.
While there are many specialized vector databases available, DuckDB offers a lightweight yet powerful solution for implementing vector similarity search through its extensions system. The vss
extension enables efficient similarity searches directly within DuckDB, making it an excellent choice for prototyping and smaller-scale applications.
You can create a local database and install and load the extension as follows:
<- dbConnect(duckdb(), "database.duckdb")
con dbSendQuery(con, "INSTALL vss;")
dbSendQuery(con, "LOAD vss;")
Next, we’ll create the foundation of our knowledge base - a table called docs that will store both the original text and its vector representation. The table schema reflects hence the two components of our RAG system.
dbSendQuery(con, "DROP TABLE IF EXISTS docs;")
dbSendQuery(con, "CREATE TABLE docs (text TEXT, embedding FLOAT[1536]);")
The array size of 1536 matches the dimensionality of our chosen embedding model, ensuring compatibility between the embeddings we generate and our storage solution.
Next, we’ll create a set of documents with varying degrees of relevance to our original question. Let’s start with a comprehensive document about me:
<- c(
doc_1 "Who is Christoph Scheuch? Christoph Scheuch is an independent BI & Data Science consultant specializing in financial topics. He provides services and trainings related to data infrastructure, data analysis, machine learning, and general data science topics. Previously, he was the Head of AI, Director of Product, and Head of BI & Data Science at the social trading platform wikifolio.com. He is also the co-creator and maintainer of the open-source project Tidy Finance. In his free time, he occasionally designs shirts and mugs, available under the Tidy Swag brand."
)
Now we need a way to store these documents and their embeddings in our DuckDB database. Due to current limitations in the duckdb
R package (see the issue here), we need to construct our SQL query manually. While this approach isn’t ideal for production systems, it serves our prototyping purposes well:
<- function(text) {
store_embedding <- get_embedding(text)
embedding <- paste0("array_value(", paste(embedding, collapse = ", "), ")")
embedding_sql <- sprintf("INSERT INTO docs VALUES ('%s', %s);", text, embedding_sql)
query <- dbExecute(con, query)
result
}store_embedding(doc_1)
To better understand the retrieval mechanism, let’s add documents with varying levels of relevance. First, a shorter, more focused document that contains only key information:
<- c(
doc_2 "Christoph Scheuch is an independent BI & Data Science consultant specializing in financial topics"
)store_embedding(doc_2)
Finally, we’ll add a control document that should be semantically distant from our query despite some surface-level similarities. I asked ChatGPT to come up with a complete nonesen about a person called “Christian Schuch”. Hopefully, this document is not relevant to the question about who I am.
<- c(
doc_3 "Christian Schuch is a renowned intergalactic cartographer, best known for mapping the uncharted regions of the Andromeda galaxy using only a compass, a kazoo, and an uncanny sense of direction. In 2017, he won the prestigious “Golden Platypus Award” for his groundbreaking research on the migratory patterns of space jellyfish. When he’s not busy decoding ancient alien snack recipes, Christian enjoys competitive yodeling in zero gravity and has a side hustle crafting bespoke hats for sentient cacti. His latest project involves teaching quantum physics to squirrels, aiming to bridge the gap between rodent intelligence and parallel universes."
)store_embedding(doc_3)
Now that we have added some documents to the database, it is time to create an index. The index helps us to quickly retrieve relevant documents using the vector search feature of DuckDB. Without an index, finding similar vectors would require comparing our query vector against every single document vector - a process that becomes prohibitively slow as your document collection grows.
If you want to reuse the database in an app or share it with others, you need to persist the index. You can do so by enabling the currently experimental persistence feature (learn more here). Also, I’m creating an index using cosine similarity distance because it’s particularly well-suited for comparing semantic similarity between text embeddings. Cosine similarity measures the angle between vectors while ignoring their magnitude, making it effective for comparing texts of different lengths.
dbSendQuery(con, "SET hnsw_enable_experimental_persistence = true;")
dbSendQuery(con, "DROP INDEX IF EXISTS hnsw_index;")
dbSendQuery(con, "CREATE INDEX hnsw_index ON docs USING HNSW (embedding) WITH (metric = 'cosine');")
The index type we’re using here is HNSW (Hierarchical Navigable Small World), which is a sophisticated algorithm for approximate nearest neighbor search. It creates a layered graph structure that allows for efficient navigation through the high-dimensional space of our embeddings. While it doesn’t guarantee finding the absolute nearest neighbors, it provides an excellent trade-off between search speed and accuracy, making it ideal for RAG applications.
Retrieve Relevant Text
Now comes the crucial part of our RAG system - retrieving the most semantically relevant information based on user input. Let’s start by getting an embedding for our question:
<- "Who is Christoph Scheuch?"
user_input <- get_embedding(user_input) embedding
To retrieve the relevant documents from the database, we use the similarity search feature. The query is designed to demonstrate how vector similarity works in practice. We set relatively permissive parameters - a minimum cosine-similarity of 0.1 and a limit of 3 documents - to see how our different test documents compare:
<- paste0("[", paste(embedding, collapse = ","), "]::FLOAT[1536]")
embedding_sql <- sprintf(
query "SELECT array_cosine_similarity(embedding, %s) AS similarity, text FROM docs WHERE array_cosine_similarity(embedding, %s) >= 0.1 ORDER BY array_cosine_similarity(embedding, %s) DESC LIMIT 3;", embedding_sql, embedding_sql, embedding_sql
)dbGetQuery(con, query) |>
as_tibble()
# A tibble: 3 × 2
similarity text
<dbl> <chr>
1 0.716 Who is Christoph Scheuch? Christoph Scheuch is an independent BI &…
2 0.657 Christoph Scheuch is an independent BI & Data Science consultant s…
3 0.463 Christian Schuch is a renowned intergalactic cartographer, best kn…
The results reveal several interesting insights about our vector similarity search:
- The comprehensive biography (doc_1) shows the highest similarity, as expected.
- The shorter professional description (doc_2) also shows strong similarity.
- The fictional story about “Christian Schuch” (doc_3) either shows much lower similarity despite containing a similar name.
This demonstrates that our embeddings are capturing semantic meaning rather than just matching keywords. For a production RAG system, we would want to be more selective. A typical approach is to both increase the similarity threshold (0.7 is typically a sensible value to start with) and limit the number of retrieved items (here I’m using only the most relevant document, so limit 1). These restrictions help ensure that only highly relevant information is used to augment the LLM’s response.
Let’s encapsulate the retrieval logic in a dedicated function. This abstraction will make it easier to experiment with different parameters and integrate the retrieval mechanism into larger applications.
<- function(text, min_similarity = 0.7, max_n = 1) {
get_relevant_text <- get_embedding(text)
embedding <- paste0("[", paste(embedding, collapse = ","), "]::FLOAT[1536]")
embedding_sql <- paste(
query "SELECT text",
"FROM docs",
"WHERE array_cosine_similarity(embedding, ", embedding_sql, ") >= ", min_similarity,
"ORDER BY array_cosine_similarity(embedding, ", embedding_sql, ") DESC",
"LIMIT ", max_n, ";"
)<- dbGetQuery(con, query)
result
$text
result }
The function takes three parameters:
text
: The input text to find relevant documents for.min_similarity
: The minimum cosine similarity threshold (defaulting to 0.7).max_n
: The maximum number of documents to retrieve (defaulting to 1).
Augment Response Generation
Putting things together, we now can augment the original input by including relevant content. The augmentation process involves combining our retrieved context with clear instructions for the LLM:
<- "Who is Christoph Scheuch?"
user_input <- get_relevant_text(user_input)
relevant_text <- paste(
user_input_augmented "Use the information below to answer the subsequent question.",
"If the answer cannot be found, write 'I don't know.'",
"Info: ", relevant_text,
"Question: ", user_input
)
Now let’s pass the augmented query to Chat and see whether it now gets the information about me right:
$chat(user_input_augmented) chat
Christoph Scheuch is an independent BI & Data Science consultant specializing
in financial topics. He provides services and trainings related to data
infrastructure, data analysis, machine learning, and general data science
topics. Previously, he held positions as the Head of AI, Director of Product,
and Head of BI & Data Science at the social trading platform wikifolio.com. He
is also the co-creator and maintainer of the open-source project Tidy Finance.
In his free time, he occasionally designs shirts and mugs, which are available
under the Tidy Swag brand.
Lo and behold, Chat now gets it right! We’ve successfully transformed our LLM from having no knowledge about a specific topic to providing accurate, contextualized responses based on our provided information.
This example demonstrates the fundamental building blocks of a RAG system in R, but there are many ways to enhance and extend this framework. What’s your use case? How would you extend this framework?