How to chat with your documents

Large language models (LLMs) are all the rage since OpenAI released ChatGPT, a conversational Natural Language Model (NLM) tuned for dialog Q&A and semantic search.

Reading technical documentation in any field is a very demanding task. For example, sustainability legislation in Europe has gone through a lot of changes recently with the introduction of CSRDs (Corporate Sustainability Reporting Directives). These are a set of guidelines that mandate businesses in disclosing their sustainability practices and impacts. They are meant to allow greater transparency to stakeholders, including investors, customers, and the general public, about a company's environmental, social, and governance (ESG) performance.

After having some interesting conversation with sustainability consultants from Move to Impact, the problem of having access to up to date information in a digestible way became more clear. So I decided to dip my toes and get hands-on extending this language models to be able to read up-to-date legislation and explain it in a simple way.

💡

TLDR: Inspired by OpenAI's ChatGPT, this article explores the architecture of a system that allow users to explore their own documents (like Europe's CSRDs legislation) using natural language. The app parses documents, generates embeddings, and offers a Q&A interface for you to chat with your content. Built using NextJS and Tailwind, it allows PDF uploads and stores embeddings for quick reference.

mti-ai-companion

ecovirtual • Updated Aug 31, 2023

System overview

Let’s zoom out and see what are the base modules involved in building an AI Chatbot companion for sustainability consultants.

UI & Frontend

We’ll present the interface for the user as a web-app to ensure the most universal access possible. For the actual implementation a popular choice would be something like NextJS to handle the base scaffolding of react components, authentication, and routing. Since NextJS can also handle some server side logic, we’ll have the document parsing running as a backend service. If you’re an edge fanatic you could even set up some lambda functions to run this kind of processes.

Drag and drop functionality for ease of usage could be solved with a library like react-dropzone.


const { getRootProps, getInputProps, open } = useDropzone({
   noClick: true,
   noKeyboard: true,
   onDrop: (acceptedFiles: File[]) => {
     setSelectedFiles(acceptedFiles);
   },
   multiple: true,
});

For the actual styles, I recently felt in love with Tailwind, a collection of styles that can bring your layout to a very good shape with the ease of just declaring a few class names in your components. And the best thing is that if you bootstrap your project with create-next-app it comes out of the box pre-configured for your convenience.

💡

If you want to try the latest and greatest in development environments give Vite a try, you wont regret it. Build times are blazing fast and the community has provided plug-ins that cover most features provided by frameworks like NextJS.

Document Parser

Since we want to connect a wide range of document sources we’ll need a robust feature that allows our users to drag & drop pretty much anything they want to connect to chatbot.

Collection of utilities to read from different sources and extract raw text

Ideally we should be able to parse multiple type of documents including, PDFs, MD, TXT, DOC, DOCX, etc.

Embedding Knowledge

If you haven’t had the time to dive deeper into Transformer Architecture (GPT = General Pre-trained Transformer) you may be wondering what the heck is an “embedding” and why do I even need one. The seminal paper titled “Attention is all you need” is very accessible if you’re already familiar with Neural Networks main concepts.

Working with embeddings is a good alternative to fine-tuning (re-training) your model. There are two main ways to add more “knowledge” to the memory of your chatbot:

Fine tuning: Re-train the original model to add more information

This is usually requires more work and computing power since it involves almost a full flow of model training, including general ETL (Extraction, Transformation, Loading) of your data, hyper-parameter tuning, and enough iterations to be happy with the result.

Aggregate Context: Add more information as “context” to your prompt or question

This approach give us less flexibility on re-shaping our model, but it’s cheaper and faster. It involves parsing our original text data into a vector representation (aka embeddings). This vector representation of our text is sent as special context for the model to take in consideration for reacting to your prompt or question.

The main limitation of this approach is the max size of the “context” that you’re allow to inject through embeddings. The state of the art models today allow you to embed only a few pages of content at a time. This will likely be less of a limitation as models evolve.

Embedding flow

Chunk raw text: embeddings are generated by parsing a discrete size of text so we need to chunk our whole body of text first.

Calculate Tokens (cost of embedding): Most services will charge your by “token” a discrete unite of measurement that usually maps amount of computation needed to run a task. We want to pre calculate the cost in order to avoid nasty surprises.

Generate embedding: finally batch process all your chunks of text and generate the vector representations of your data.

Storing your context: use a vector-based database to save your documents. By storing the vector representation of your data you only need to spend the parsing computation once and you can re-use it every time you ask a question to your chatbot.


// Load PDF files from the specified directory
const directoryLoader = new DirectoryLoader(filePath, {
  '.pdf': (path) => new PDFLoader(path),
  '.docx': (path) => new DocxLoader(path),
  '.txt': (path) => new TextLoader(path),
});

const rawDocs = await directoryLoader.load();

// Split the PDF documents into smaller chunks
const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: Number(chunkSize),
  chunkOverlap: Number(overlapSize),
});

const docs = await textSplitter.splitDocuments(rawDocs);

// OpenAI embeddings for the document chunks
const embeddings = new OpenAIEmbeddings({
  openAIApiKey: openAIapiKey as string,
});

// Get the Pinecone index with the given name
const index = pinecone.Index(targetIndex);

// Store the document chunks in Pinecone with their embeddings
await PineconeStore.fromDocuments(docs, embeddings, {
  pineconeIndex: index,
  namespace: namespaceName as string,
  textKey: 'text',
});

Query Enhancing

Now we have all we need to enrich our initial prompt or question so that our chatbot has enough context to give us an answer that includes our own data. The trick is to combine our prompt with contextual embeddings before we send it to the inference process of our LLM.

Data Parsing Flow

Get data from file

Parse PDF to Text

Chunk raw text

Calculate amount of Tokens (for cost)

Generate Embeddings

Store embeddings

Generating Answers

Search related embeddings based on user query or prompt

Attach embedding context (and chat history) to query

Send everything to the inference process

Parse response and serve it to the user

Options for data ingestion

Devs on the wild have already started playing with embeddings and trying to figure out how to extend LLM with contextual embedded data. I took inspirations from this projects and decided to mix my own brew.

For data ingestion there are a few options depending on your needs.

Having an independent ingestion pipeline that regularly gets triggered with new data being dropped in a directory and produce the corresponding embeddings storing them in a database or file.

Having a document parsing feature in-app for each user to drop their own files or URLs

Generate embeddings every time document is loaded and store them locally
Generate embeddings only once and store them in a vector database

The last two options can have a drastically difference in cost, depending on your use-case. Generating embeddings cost money if you’re using OpenAI’s endpoint. GTP-3.5 embeddings are quite accessible for casual usage, but costs can spike fast if your documents are very lengthy.

In case of storing the embeddings locally you will also need to perform the search against this embeddings in your client, probably using JavaScript. Performing math operation like dot product to obtain cosine similarity of vectors is not something that JS is optimise for. So we could move this logic to the storage layer. Using the vector extension for PostgreSQL in combination with query functions can be a powerful combination to squeeze that extra juice of performance if you really need it.

I can’t write an article about LLMs without mentioning LangChain a humble yet powerful tool packed with helper methods to assist you in fetching and parsing data and connecting it with different models. Many of you online have already build interesting data ingestion flows using LangChain, and it truly seems like a great development experience.

The App

mti-ai-companion

ecovirtual • Updated Aug 31, 2023

Some of the base features that the app offers:

Drop PDF section

Store embeddings in Pinecone vector database

Only generate new embeddings for new files

Check for same file name & embeddings similarity

Q&A Section

Perform cosine distance calculations in the database to retrieve only relevant bits of content that can fit under the current limitations of context size

Answers include source reference pointing to the original document, so that the user can always refer back to the original material

Aggregate chat history to answer further questions in the same session

I hope this review was helpful for you to go and try your own experiments with LLMs. This technology has never been so accessible as today and the possibilities are incredible. 🤖 👾

Acknowledgements

Shout-out to Erik Závodský for building doc-chatbot, a project that implement a lot of what I just described. I ended up using a lot of his source code for the featured project in this article.

Thanks again to the consultants at Move To Impact for innovating at the cutting edge of sustainability and inspiring this project.