Ollama is a tool that allows you to easily deploy and manage Open Source LLMs. You can even run the Large Language Model locally and augment it using the Retrieval Augmented Generation (RAG) technique.
In this article we will show you how to use Ollama in LangStream to build a chat bot, combining the power of Ollama models with the RAG (Retrieval augmented generation) pattern.
In order to build a Generate AI application you need a Large Language Model (LLM), and you have two main options:
- use an LLM as a service, like OpenAI, Google Vertex, AWS Bedrock, etc.
- run an Open Source Model, like Llama2 or other models taken from HuggingFace
Running an open source is great for local development and testing. For production, you may want to use an LLM as a service, especially since LLM services now offer open source models.
Starting an Ollama model
You can download the Ollama runtime from the website Ollama and follow the instructions to run it locally.
Running a model locally requires a lot of resources, depending on the size of the model.
Let’s start with the Llama2 13b model, which is a smaller model that can run on a laptop with 16GB of RAM.
Once you have the Ollama installed you can start the LLM with this command:
ollama run llama2:13b
You can start chatting with the model using the CLI tools, but you won’t be able to use RAG to augment the knowledge of the model.
The Ollama runtime exposes an HTTP API that allows you to interact with the model programmatically. This is what we use to integrate it with LangStream.
Connecting to the model
In order to connect to the LLM you have to provide the configuration for an ollama-configuration
resource in configuration.yaml
:
resources:
- type: "ollama-configuration"
name: "ollama"
id: "ollama"
configuration:
url: "http://host.docker.internal:11434"
In this example we use http://host.docker.internal:11434
as the URL because usually when you develop on LangStream your application runs in a docker container and the special hostname host.docker.internal allows you to connect to the host machine. Port 11434 is the default port that the Ollama runtime uses.
In LangStream it is very easy to swap the LLM, it is usually only a matter of changing the declaration of the LLM resource in configuration.yaml
. So if you are already trying LangStream with OpenAI you can just add the Ollama configuration and start using it.
Please note that the prompt may need to be changed from what you are using with OpenAI. The results depend on how the model has been trained, and some models (especially the Llama2 models) tend to be more chatty than OpenAI in its responses and may require different prompt engineering.
Using the Chat Completion API
Now in order to query the LLM it is enough to use the ai-chat-completion
agent like this:
- name: "ai-chat-completions"
type: "ai-chat-completions"
configuration:
ai-service: "ollama"
model: "${secrets.ollama.model}"
completion-field: "value.answer"
log-field: "value.prompt"
stream-to-topic: "answers-topic"
stream-response-completion-field: "value"
min-chunks-per-message: 20
messages:
- content: |
I have this list of documents that may help you in answering to my question below.
Documents:
{{# value.related_documents}}
{{ text}}
{{/ value.related_documents}}
Please try to leverage the content above in your answer as much as possible.
Take into consideration that I am always asking questions about the LangStream project.
Do not provide information that is not related to the LangStream project.
This is my question:
{{ value.question}}
Implementing RAG
RAG is all about adding relevant documents to the prompt that you send to the LLM. This additional context is used to improve the quality of the responses.
In order to feed more context to the LLM in the prompt you are going to populate the value.related_documents
with some query on a vector database.
In case of a chatbot you usual start from a question from the user, then you perform a similarity search on a vector database and load more context about the user and the current conversation. Finally you put all the relevant data into the prompt and send it to the LLM.
This is a typical RAG pipeline:
topics:
- name: "questions-topic"
creation-mode: create-if-not-exists
- name: "answers-topic"
creation-mode: create-if-not-exists
- name: "log-topic"
creation-mode: create-if-not-exists
errors:
on-failure: "skip"
pipeline:
- name: "convert-to-structure"
type: "document-to-json"
input: "questions-topic"
configuration:
text-field: "question"
- name: "compute-embeddings"
type: "compute-ai-embeddings"
configuration:
ai-service: "huggingface"
model: "${secrets.hugging-face.embeddings-model}" # This is the id of the model
model-url: "${secrets.hugging-face.embeddings-model-url}" # This is the URL of the repository containing the model
embeddings-field: "value.question_embeddings"
text: "{{ value.question }}"
flush-interval: 0
- name: "lookup-related-documents"
type: "query-vector-db"
configuration:
datasource: "JdbcDatasource"
query: "SELECT text,embeddings_vector FROM documents ORDER BY cosine_similarity(embeddings_vector, CAST(? as FLOAT ARRAY)) DESC LIMIT 20"
fields:
- "value.question_embeddings"
output-field: "value.related_documents"
- name: "re-rank documents with MMR"
type: "re-rank"
configuration:
max: 10 # keep only the top 5 documents, because we have and hard limit on the prompt size
field: "value.related_documents"
query-text: "value.question"
query-embeddings: "value.question_embeddings"
output-field: "value.related_documents"
text-field: "record.text"
embeddings-field: "record.embeddings_vector"
algorithm: "MMR"
lambda: 0.5
k1: 1.2
b: 0.75
- name: "log documents"
type: "log-event"
- name: "ai-chat-completions"
type: "ai-chat-completions"
configuration:
ai-service: "ollama"
model: "${secrets.ollama.model}"
# on the log-topic we add a field with the answer
completion-field: "value.answer"
# we are also logging the prompt we sent to the LLM
log-field: "value.prompt"
# here we configure the streaming behavior
# as soon as the LLM answers with a chunk we send it to the answers-topic
stream-to-topic: "answers-topic"
# on the streaming answer we send the answer as whole message
# the 'value' syntax is used to refer to the whole value of the message
stream-response-completion-field: "value"
# we want to stream the answer as soon as we have 20 chunks
# in order to reduce latency for the first message the agent sends the first message
# with 1 chunk, then with 2 chunks....up to the min-chunks-per-message value
# eventually we want to send bigger messages to reduce the overhead of each message on the topic
min-chunks-per-message: 20
messages:
- content: |
I have this list of documents that may help you in answering to my question below.
Documents:
{{# value.related_documents}}
{{ text}}
{{/ value.related_documents}}
Please try to leverage the content above in your answer as much as possible.
Take into consideration that I am always asking questions about the LangStream project.
Do not provide information that is not related to the LangStream project.
This is my question:
{{ value.question}}
With this pipleline definition you can experiment with the Llama2 13b model and augmenting its knowledge with your own data right from your laptop. This pipeline also uses a embedding model downloaded from Hugging Face that also runs locally. The entire pipeline can be run on your laptop while flying without having to pay outrageous fees for in-flight WiFi (or to the LLM service providers).
Please refer to the LangStream documentation in order to learn the details about building a RAG pipeline.
Conclusion
Using OpenSource LLMs is a great way to build AI applications, but it requires some effort to manage the models and the infrastructure. With LangStream and Ollama you can seamlessly develop and test open source models on your laptop and build sophisticated AI applications.
Please send us feedback in Slack or Linen. If you find a bug, please open a GitHub issue.