My learning documentation for RAG

<aside> đź“–

Contents

</aside>

What is RAG?

Imagine we have a robot which knows a bit of everything. Now we want it to be able to answer some questions about my best friend. If we just ask the robot, it may answer some general answers like what most humans do. Therefore, we want to “plug in” an USB which stores the info of my best friend, so it can learn about my friend.

<aside> 🤖

The combination of “the robot” + “USB” = RAG

</aside>

See my Internship Project

Google Colab

Parsing

One of the approach that I came across — Unstructured.io

<aside> 🧑‍🤝‍🧑

Another approach I chose —Llama Parse

  1. used Llama Parse
  2. requires API key
  3. Code Example:
parser = LlamaParse(
                    result_type="markdown",
                    parsing_instruction=instruction,
                    max_timeout=5000)
                    
parser.load_data("path_to_pdf")
  1. References:
    1. https://docs.cloud.llamaindex.ai/llamaparse/getting_started/python
    2. https://colab.research.google.com/drive/1dO2cwDCXjj9pS9yQDZ2vjg-0b5sRXQYo (how Llama parse instruction matters)

Loading

  1. used LangChain Unstructured Markdown loader

  2. why? because in the subsequent step, we want to use the LangChain text splitter

  3. Code Example

    loader = UnstructuredMarkdownLoader(
        "path_to_md",
        mode="single",
        strategy="fast",
    )
    loaded_doc = loader.load()[0]