Retrieval Augmented Generation

My learning documentation for RAG

<aside> 📖

Contents

</aside>

What is RAG?

Imagine we have a robot which knows a bit of everything. Now we want it to be able to answer some questions about my best friend. If we just ask the robot, it may answer some general answers like what most humans do. Therefore, we want to “plug in” an USB which stores the info of my best friend, so it can learn about my friend.

<aside> 🤖

The combination of “the robot” + “USB” = RAG

</aside>

See my Internship Project

Google Colab

Parsing

One of the approach that I came across — Unstructured.io

<aside> 🧑‍🤝‍🧑

Unstructured.io </aside>

Another approach I chose —Llama Parse

used Llama Parse
requires API key
Code Example:

parser = LlamaParse(
                    result_type="markdown",
                    parsing_instruction=instruction,
                    max_timeout=5000)
                    
parser.load_data("path_to_pdf")

References:
1. https://docs.cloud.llamaindex.ai/llamaparse/getting_started/python
2. https://colab.research.google.com/drive/1dO2cwDCXjj9pS9yQDZ2vjg-0b5sRXQYo (how Llama parse instruction matters)

Loading

used LangChain Unstructured Markdown loader
why? because in the subsequent step, we want to use the LangChain text splitter

Code Example

loader = UnstructuredMarkdownLoader(
    "path_to_md",
    mode="single",
    strategy="fast",
)
loaded_doc = loader.load()[0]