How to build your own LLM applications with RAG, Prompt templates, Vector databases?

LLMs are large language models which have the ability to take natural language inputs and provide a response. This includes generating code, writing an essay, answering questions and much more. Recently, there have been several advanced models launched such as GPT-40, Claude 3 Opus, Claude 3.5 Sonnet, GPT-4, Gemini Pro.

If you are building an LLM application, below are the key categories in which your applications might fit into.

Examples of Large Language Model applications

Large language models have wide applicability due to the general purpose abilities they possess. There is a varying level of complexity for “successfully” building each type of application. The simplest types of applications (e.g. generate an email subject line) may require just an OpenAI API call, while a complex one such as “analyze the DNA sequence and generate an analysis” may require fine tuning, data retrieval, embeddings, agents, chaining and lots of testing.

Below are some of the popular categories of LLM applications:

Content generation LLM applications: These are applications which require an input from the user which may include a blob of text or a specific instruction, and the LLMs can generate content based on the input. Few examples of content applications are: draft a sales email, generate a subject line, write a blog post, fix grammar in my text. These applications don’t necessarily require much of your internal data, or even recent information such as world events and can be built by creating a simple prompt and calling LLM APIs.

Code generation AI Assistants: There are LLM models which do a very a good job at generating boiler plate code, or even generating code using knowledge from stack overflow or publicly available APIs or developer documentations from different products. That said, if you are building an internal code generation application for your company, it means you’ll have to refine your LLM setup to specifically understand your internal documentation, code samples, and generate code which is specific to your internal services. Tools such as Github copilot already do a good job with code generation, however, there is less control over what you can make it do. There also are fine-tuned LLMs which are specifically trained for code generation. Here are some examples of code generation with AI

- Write a script to generate email address from name and domain
- Create a payment form for accepting credit card payments

Example of AI Code Generation built with Bind AI.

LLM powered Information Extraction, E-commerce Search or Matching: LLMs are language models and they do an excellent job of extracting entities from unstructured text (e.g. Data Extraction with AI which extract company or person names from a news article text), matching text (e.g. job recommendation, identity resolution), search relevant entries from your catalog/index (e.g. search e-commerce listing based on NLP input). Long back, each of these tasks required significant use of ML engineering, and now all of this can be easily done with LLMs.

Conversational AI Assistants: These are chatbots for support, sales, in-product pages, slack bots or gmail bots. Successfully building these require much more prompt engineering, access to information, examples, instructions and most importantly, testing and optimization. Conversational experiences are very open-ended, which means there isn’t really a limitation on what the end-user can enter, and your LLM application will have to be ready to deal with the ambiguity and the knowledge to avoid your users bouncing off or asking for a “real human” to chat with them. You also wouldn’t want your code assistant to answer questions such as “What is Taylor Swift’s latest album”, and put boundaries on what your bot should and should not do.

Limitations of Large Language Models

Most LLM models are trained on data up to a certain point in time, after which the models don’t really have information on what’s happening in the world, they also do not have information about your specific internal. As an example, most of the GPT models are trained on data until Sep 2021, which means if you ask a question about a news event which happened in 2023, the model by itself won’t know the factual answer to it, and will likely predict a sequence of words which most closely fit the user input (aka hallucination). The models also don’t have information about your private customer data, internal knowledge bases, proprietary code and can’t answer specific questions about those out-of-the-box. If your use case requires leveraging this data, you’ll need to leverage other mechanisms to provide this information. See cutoff dates for models here.
LLMs by themselves cannot take actions: It’s important to understand that LLMs are large language models which can take inputs or instructions and return text to satisfy the requested criteria. It is no Skynet. In order to create a Skynet, the LLMs will need a way to reason or think, execute a plan, connect to internal/external systems or services, execute sequential actions. Imagine you want to create an evil Skynet which can launch a rocket by itself. There are several things it will need to do, first, it will need to think and plan which rockets to launch, how to get access, which programs to execute, trigger the programs, if it has to hack the passwords it will have to think of it as a specific action and execute. It will be complex and beyond just simple text.in/text.out. Good news is that it is indeed possible to build your own Skynet powered by LLMs, keep reading if you want to do that.

How to use Large Language Models to create your AI applications

There are few different ways in which you can create LLM powered AI applications which can get the necessary information (e.g. most recent news, real-time stock quotes, internal data) and can execute a series of actions or programs. Let’s learn the key concepts which are essential for developing LLM applications.

Selecting Large Language Model (LLM) best suited for your application

As of January 2024, there are several dozen LLM models publicly available either via APIs or open source. The top contenders are OpenAI GPT, Google Gemini, Llama-2, Mistral. There are few different options to decide how you want to approach your strategy for models.

LLM Models available via APIs: These are models such as OpenAI GPT 3.5 or 4.0, which are available via easy to use APIs. You can include a prompt, your data and instantly get a response from the model. There are several powerful models available, however, one of the potential risks is exposing your proprietary data to an system outside of your controlled environment or private cloud. (List of models supported by Bind AI)
Open source LLM Models: There are several powerful LLM models such as Meta LLaMA-2, Mistral, Bloom, MPT which are openly available. Some models such as Llama-2 are allowed to use for commercial purposes with no limitations for most companies. Since these are openly available, there are no APIs readily accessible, you’ll need to download the model and install it on your servers. Doing such requires significant GPU memory and there is a cost involved in maintaining the system. The benefit being you entirely control the model and data in your environment and servers, especially if you are a corporation working in industries such as healthcare, defense, financial services.
Create your own LLM model and train it with your desired data: This approach can be very expensive, requires significant investment and skillset, and still will end up with the same stale data problem. Companies such as OpenAI and Google have spent several years iterating their models and millions of dollars in compute costs to train their models. Replicating those efforts is definitely an undertaking. Moreover, once you’ve trained the model, new information wont be available and you’ll need to continuously evolve your model information and capabilities
Fine-tune an existing LLM model: This is a cheaper option than creating your own model, however, it still has its limitations and its typically used to train on specific examples, set of instructions or to reduce the amount of fixed tokens you are including in each prompt. It also has the same data staleness limitation as the option above. You could in theory continuously fine-tune your model with new info, but that might be expensive and unnecessary as there are better ways to do it.

Creating a Prompt Template (aka Prompt Engineering)

For using a LLM, you’ll need to first setup a prompt template for your application, which is a fixed set of instructions which you will always include in each prompt you send to the LLM model. This is a bit different from the prompts you enter in chatGPT. What you enter in chatGPT is really the user input, which gets combined with other information (e.g chatGPT plugins) and then gets sent to the LLM model. For your use cases, for every user input, you will need a combination of the following: {Prompt Template + User Input + Additional context/information}.

Here’s an example of a “prompt template” which you will use in an application which can extract entities from text and create a JSON:

You are a Data Extractor bot. Your task is to extract entities from provided human inputs and respond in a json with details about the following entities all the person, company, email, phone number, topic, funding, executive names, financial details, location, key events mentioned in the text Wherever necessary, you will create an array object and keep the related information together. If you don’t find any relevant data to extract, skip the field in the json. Make sure to enclose your entire response in “`

Using Retrieval Augmented Generation (RAG) and Vector Embeddings

As mentioned above, LLMs don’t have access to real-time or proprietary information as they are training on a periodic basis. For most applications, you will require providing your proprietary information to the models such that it can provide a factual and relevant respond for your application. The concept is known as RAG or Retrieval Augmented Generation, whether the LLM model is augmented with relevant information so that it can give a reasonable reponse.

Below are the most popular mechanisms to implement Retrieval Augmented Generation (RAG):

Create embeddings using a Vector Store: Simply described, this process allows you to create your own index of data, and allows the LLMs to fetch the most relevant chunk of information from your index to augment your prompt. As an example, imagine you are building a new way to search Shopify or e-commerce product listings based on an natural language query such as “find me blue shoes which have a brown sole and yellow colored laces”. You could achieve that by creating an index of all your product listings and descriptions in a vector db (e.g. pinecone) or index (e.g. faiss), retrieve the most relevant chunks based on similarity, provide that in a prompt to the LLM model, and then wait for the LLM to give you the most useful result. In this type of setup, you are using a prompt template (which defines the purpose, instructions, limitations) and retrieve embeddings. This approach is the most cost effective and fastest to implement.
Dynamically inject data for hardcoded variables in your prompt template: In this option, you can include content in your prompt template (e.g. if its a restaurant ordering bot, then you can include menu in the prompt template itself). You could also dynamically include information for a fixed set of variables. As an example, you could include a variable for “user_name”, dynamically fetch the name for each user who is interacting as a pre-processing step, then update your prompt template and then hit the LLM API. You could use a multiple such hard-coded variables for which you can fetch data via your internal services. It is useful for responding to very specific type of inputs which you already anticipate (e.g. What is the status of my order). This is different from using embeddings, where the LLM is getting relevant chunks of informations automatically based on the user input. Embeddings can solve for wider variety of user inputs, where are prompt encoded data variables can solve for specific known purposes.
Creating LLM Tools or Plugins to dynamically fetch information based on the user input or question to be answered. Think of this as a wrapper on top of your existing services, APIs, databases. As an example, imagine you have an elastic search index where you are storing information about your customers. For an LLM chat application, you need to greet the customer based on their timezone. In this case, you can create a tool which can call an API to your data store (e.g. Elastic Search index) and get the specific timezone for a given customer and append the information in the final prompt. Doing this requires an intermediate planner or an LLM agent which first plans the steps it needs to take, generates the action plan, picks the tool which can get the necessary information, triggers the tool and retrieves the information which goes into your prompt. We will have a separate post on LLM agents and how to best leverage them, for now you can read this paper on ReACT approach for planning and executing actions/tasks.
Conversational Memory: This is not exactly used for real-time or proprietary information retrieval, however it is an important component for building chat assistants. Conversational memory is a way to persist and retrieve the previous chat history with the user, which can then be included in the prompt before calling the LLM model. If you don’t include history, it will be a terrible experience for your end user where the chat bot won’t know their name even if the user has already mentioned that a few seconds ago. We’ll go in detail in subsequent posts.

Once you have all the pieces together, you can combine the {Prompt Template + User Input + Additional context/information + Optional Conversational Memory} and call the LLM model (e.g. GPT, LLaMA-2) and get the LLM response.

If you want to go deep into large language model architecture and the best models available, you can read our blog post comparing the LLM models by Google and Open AI, or the Claude family of advanced models

Hopefully this was informative, please leave a comment with your thoughts. If you are looking to build your own LLM applications, please try Bind AI