Create Your Own Knowledge Base with LLM

Yashod Perera
8 min readJun 7, 2024

--

LLMs are commonly used for day to day tasks by different types of people for different tasks. But there are issues with these LLMs. One of the major concern is hallucination where it answered confidently but the answer is totally incorrect. The main reason behind this is LLMs are trained with large set of data from the internet and different sources.

What if you can provide your own information to LLM to refer when answering?

Photo by Jonathan Kemper on Unsplash

In this project we are learning how we can provide our own data to the LLM for reference and for this mini project I am using several PDFs to input data and refer those as the context. In later blog let’s discuss how we can plug a proper vector database to improve performance.

Following are some technical words that you should know when it comes to LLMs.

  • Context — LLM referring to a context when answering a question. You can provide the context or else it will use its own.
  • Prompt Engineering — Prompt contains instructions how to provide the answer. You can engineer the prompt in a manner where the LLM can effectively answer the question.
  • Vector database — Vector databases are commonly used to store data for LLM to refer. If we want to save information in the vector database then you need to first embed the data (which converts the text to numbers) and store the vectors. When we ask a question from the LLM and when we connect a vector database it search for the data which share the same context and use them to answer the question.

So let’s dive in to the project. In this project I am using open source LLM for demonstration which runs in my local machine but if you need you can simply use OpenAI APIs to call the LLM.

Prerequisites

You need to install Ollama which helps you to run models locally in your machine.

Then you can run llama3 which is an open source LLM using ollama run llama3 . Here we are using llama3 8B model which is 4.7GB in size if you need you can use 70B model which will be 40GB in size

You can play with the running model as follows.

Please note that this project can be found in this github repository.

Architecture and Flows

Following contains the simple architecture and the flow of how the components are connected.

Following is the complete flow.

Step 1 — Load the Model and Create Embeddings

Please note that we need to create embeddings depending on the model that we are using.

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

MODEL = "llama3"
model = Ollama(model=MODEL)
# embedding is specific to the model
embeddings = OllamaEmbeddings(model=MODEL)

Step 2 — Load Data to In-memory Vector Store

I have created some pdfs using Choreo home page since I am going to ask questions related to Choreo in this mini project.

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import DocArrayInMemorySearch

# add the files you have
pdf_files = ["choreo.pdf", "another_document.pdf", "yet_another_document.pdf"]

pages = []
for pdf_file in pdf_files:
loader = PyPDFLoader(pdf_file)
data = loader.load_and_split()
pages.extend(data)

# Create the vector store where you store your files as vectors.
vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)
# Retriver will fetch top 4 documents for the given context.
retriever = vectorstore.as_retriever()
  • Using DocArrayInMemorySearch.from_documents(pages, embeddings=embedding) will create a in memory vector store using provided pages with provided embedding. Think embeddings as how you can convert words to numbers.
  • retriever will retrieve top 4 similar documents when user provides a question.

Example

retriever.invoke("choreo")

Results

[Document(page_content="Deploy Your First Service\nChoreo, an Internal Developer Platform (IDevP), simplifies the deployment,\nmonitoring, and management of your cloud-native services, allowing you to\nfocus on innovation and implementation.\nChoreo allows you to easily deploy services you've created in your preferred\nprogramming language in just a few steps.\nIn this guide, you will:\nUse a pre-implemented service that has resources to maintain a book list.\nBuild and deploy the service in Choreo using the Nodejs buildpack. It runs\non port 8080.\nTest the service.\nPrerequisites\n1. You must have a GitHub account with a repository that contains your\nservice implementation. To proceed with the steps in this guide, you can\nfork the Choreo sample book list service repository, which contains the\nsample for this guide.\n2. If you are signing in to the Choreo Console for the first time, create an\norganization as follows:\na. Go to https://console.choreo.dev/, and sign in using your Google,\nGitHub, or Microso\x00 account.\nb. Enter a unique organization name. For example, Stark Industries.\nc. Read and accept the privacy policy and terms of use.\nd. Click Create.\nThis creates the organization and opens the organization home page.\nType to start searchingSearch\n06/06/2024, 20:09 Deploy Your First Service -\nhttps://wso2.com/choreo/docs/quick-start-guides/deploy-your-first-service/ 1/6", metadata={'source': 'choreo.pdf', 'page': 0}),
Document(page_content="Field Value\nProject Display Name Book List Project\nName book-list-project\nProject Description My sample project\n4. Click Create. This creates the project and takes you to the project home\npage.\nStep 2: Create a service component\nLet's create a service component by following these steps:\n1. On the project home page, click Service under Create a Single\nComponent.\n2. Enter a unique name and a description for the service. For this guide, let's\nenter the following values:\nField Value\nName Book List\nDescription Gets the book list\n3. Go to the GitHub tab.\n4. To allow Choreo to connect to your GitHub account, click Authorize with\nGitHub. If you have not already connected your GitHub repository to\nChoreo, enter your GitHub credentials and select the repository you\ncreated in the prerequisites section to install the Choreo GitHub App.\nAlternatively, you can paste the Choreo sample Book List Service\nrepository URL in the Provide Repository URL field to connect to it\nwithout requiring authorization from the Choreo Apps GitHub application.\nHowever, authorizing the repository with the Choreo GitHub App is\nnecessary if you want to enable Auto Deploy for the component.\n06/06/2024, 20:09 Deploy Your First Service -\nhttps://wso2.com/choreo/docs/quick-start-guides/deploy-your-first-service/ 3/6", metadata={'source': 'choreo.pdf', 'page': 2}),
Document(page_content='1. In the Choreo Console le\x00 navigation menu, click Test and then click\nConsole.\n2. In the OpenAPI Console that opens, select Development from the\nenvironment drop-down list.\n3. In the Endpoint list, select Books REST Endpoint.\n4. Expand the GET /books method and click Try it out.\n5. Click Execute.\n6. Check the Server Response section.\nSimilarly, you can expand and try out the other methods.\nA\x00er you have successfully tested your service, you can now try out various\nother Choreo features such as managing, observing, DevOps, etc., similar to\nany other component type within Choreo.\n06/06/2024, 20:09 Deploy Your First Service -\nhttps://wso2.com/choreo/docs/quick-start-guides/deploy-your-first-service/ 6/6', metadata={'source': 'choreo.pdf', 'page': 5}),
Document(page_content="The Choreo GitHub App requires the following permissions:\nRead and write access to code and pull requests.\nRead access to issues and metadata.\nYou can revoke access if you do not want Choreo to have access to your GitHub account.\nHowever, write access is exclusively utilized for sending pull requests to a user\nrepository. Choreo will not directly push any changes to a repository.\n5. Enter the following information:\nField Description\nOrganization Your GitHub account\nRepository choreo-sample-book-list-service\nBranch main\n6. Select the NodeJS buildpack.\n7. Enter the following information.\nField Description\nNodeJS Project Directory /\nLanguage Version 20.x.x\n8. Click Create.\nYou have successfully created a Service component with the NodeJS\nbuildpack. Now let's build and deploy the service.\nStep 3: Build and deploy\nNow that the source repository is connected and Choreo has set up the\nendpoints based on the repository's configuration, it's time to proceed withNote\n06/06/2024, 20:09 Deploy Your First Service -\nhttps://wso2.com/choreo/docs/quick-start-guides/deploy-your-first-service/ 4/6", metadata={'source': 'choreo.pdf', 'page': 3})]

Step 3 — Construct the Prompt

Then we need to instruct the LLM when we ask a question it should answer. here we are simply saying if you need to answer please follow the given context and if not please answer as I don't know .

from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't
find the answer, just say "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
# Here we mention that we are passing context and question to the prompt
prompt.format(context="here is some context", question="Here is a question")

Step 4 — Ask Question

Finally user can ask the question and when user ask the question following will be done.

  1. Get the top 4 matching documents for the question from the vector database
  2. Pass those document as the context to the prompt
from operator import itemgetter

# itemgetter will get the question filed data and pass it to the retriever to get the context
agent = (
{"question": itemgetter("question"), "context": itemgetter("question") | retriever}
| prompt
| model
)

agent.invoke({"question": "Please provide me step to create a service in choreo"})

It answer me as follows

"Based on the provided context, here are the steps to create a service in Choreo:\n\nStep 1: Create a project\n\n* Go to https://console.choreo.dev/ and sign in.\n* On the organization home page, click + Create Project.\n* Enter a display name, unique name, and description for the project.\n\nStep 2: Create a service component\n\n* On the project home page, click Service under Create a Single Component.\n* Enter a unique name and a description for the service. For this guide, let's enter:\n\t+ Name: Book List\n\t+ Description: Gets the book list\n* Go to the GitHub tab.\n* To allow Choreo to connect to your GitHub account, click Authorize with GitHub. If you have not already connected your GitHub repository to Choreo, enter your GitHub credentials and select the repository you created in the prerequisites section to install the Choreo GitHub App.\n\nNote: These steps are based on the provided context, so please refer to the actual guide for more detailed information if needed."

Hope you got the basic understanding how to create your own knowledge base with LLM. This is called RAG (Retrieval-Augmented Generation). There will be two upcoming logs which are,

  1. Different types of prompt engineering where we will learn about RAG, COT, ReACT etc.
  2. Create a proper RAG with vector database and provide a chat interface

If you have found this helpful please hit that 👏 and share it on social media :).

Source Code — https://github.com/yashodgayashan/inmemory-vector-rag

--

--

Yashod Perera

Technical Writer | Tech Enthusiast | Open source contributor