Retrieval-Augmented Generation with GitHub¶
This notebook demonstrates how to perform Retrieval-Augmented Generation (RAG) with magentic using the GitHub API. Essentially, RAG provides context to the LLM which it can use when generating its response. This approach allows us to insert new or private information that was not present in the model's training data.
# Install dependencies (skip this cell if already installed)
! pip install magentic
! pip install ghapi
# Configure magentic to use the `gpt-3.5-turbo` model for this notebook
%env MAGENTIC_OPENAI_MODEL=gpt-3.5-turbo
env: MAGENTIC_OPENAI_MODEL=gpt-3.5-turbo
Let's start by creating a prompt-function to generate some text recommending GitHub repos for a topic.
# Create a prompt-function to describe the latest GitHub repos
from IPython.display import Markdown, display
from magentic import prompt
@prompt(
"""What are the latest github repos I should use related to {topic}?
Recommend three in particular that I should check out and why.
Provide a link to each, and a note on whether they are actively maintained.
"""
)
def recommmend_github_repos(topic: str) -> str: ...
output = recommmend_github_repos("LLMs")
display(Markdown(output))
- Hugging Face Transformers: This repository contains a library for Natural Language Processing (NLP) tasks using the latest Transformer models, including LLMs. It is actively maintained by Hugging Face, a popular NLP research group, and has a large community contributing to it.
Link: https://github.com/huggingface/transformers
- OpenAI GPT-3: This repository contains the code for OpenAI's GPT-3 model, one of the most advanced LLMs available. While the repository may not be frequently updated due to proprietary restrictions, it provides valuable insights into how state-of-the-art LLMs are implemented.
Link: https://github.com/openai/gpt-3
- AllenNLP: AllenNLP is a deep learning library for NLP research that provides easy-to-use tools for building and experimenting with LLMs. The repository is actively maintained by the Allen Institute for AI and offers a wide range of pre-trained models, including BERT and GPT-2.
Link: https://github.com/allenai/allennlp
Please note that the availability and maintenance status of these repositories may change over time, so it's a good idea to check for the latest updates before diving in.
The LLM has no knowledge of GitHub repos created after its knowledge cutoff date! Also, it occasionally hallucinates some of its answers. To solve these issues we need to provide it with up-to-date information in the prompt, which it can use to generate an informed answer.
First we'll create a function for searching for GitHub repos.
# Create a function to search for GitHub repos
from ghapi.all import GhApi
from pydantic import BaseModel
github = GhApi(authenticate=False)
class GithubRepo(BaseModel):
full_name: str
description: str
html_url: str
stargazers_count: int
pushed_at: str
def search_github_repos(query: str, num_results: int = 10):
results = github.search.repos(query, per_page=num_results)
return [GithubRepo.model_validate(item) for item in results["items"]]
# Test that github search works
for item in search_github_repos("openai", num_results=3):
print(item.model_dump_json(indent=2))
{ "full_name": "openai/openai-cookbook", "description": "Examples and guides for using the OpenAI API", "html_url": "https://github.com/openai/openai-cookbook", "stargazers_count": 55805, "pushed_at": "2024-04-19T19:05:02Z" } { "full_name": "betalgo/openai", "description": "OpenAI .NET sdk - Azure OpenAI, ChatGPT, Whisper, and DALL-E ", "html_url": "https://github.com/betalgo/openai", "stargazers_count": 2721, "pushed_at": "2024-04-20T22:50:28Z" } { "full_name": "openai/openai-python", "description": "The official Python library for the OpenAI API", "html_url": "https://github.com/openai/openai-python", "stargazers_count": 19786, "pushed_at": "2024-04-21T01:04:42Z" }
Now, we can provide the results of the search as context to the LLM to create an improved recommmend_github_repos
function.
# Combine the search with a prompt-function to describe the latest GitHub repos
from magentic import prompt
@prompt(
"""What are the latest github repos I should use related to {topic}?
Recommend three in particular that I should check out and why.
Provide a link to each, and a note on whether they are actively maintained.
Here are the latest search results for this topic on GitHub:
{search_results}
""",
)
def recommmend_github_repos_using_search_results(
topic: str, search_results: list[GithubRepo]
) -> str: ...
def recommmend_github_repos(topic: str) -> str:
search_results = search_github_repos(topic, num_results=10)
return recommmend_github_repos_using_search_results(topic, search_results)
output = recommmend_github_repos("LLMs")
display(Markdown(output))
Based on the latest search results, here are three GitHub repos related to Large Language Models (LLMs) that you should check out:
-
- Description: gpt4all: run open-source LLMs anywhere
- Stargazers Count: 63,790
- Last Pushed: 2024-04-19
- Active Maintenance: Yes
-
- Description: Unify Efficient Fine-Tuning of 100+ LLMs
- Stargazers Count: 17,047
- Last Pushed: 2024-04-21
- Active Maintenance: Yes
-
- Description: A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers)
- Stargazers Count: 8,484
- Last Pushed: 2024-01-10
- Active Maintenance: It seems less actively maintained compared to the other two repos, but still worth checking out.
These repos cover a range of topics related to LLMs and can provide valuable resources and tools for your projects.
Now the answer contains up-to-date and correct information!