Project Hub | Building a Chatbot with Llama 2 Model

Building a Chatbot with Llama 2 Model

74 View(s)

0 Like(s)

1 Comment(s)

EE

Eddy Ejembi

Author

Introduction

Generative AI has become widely adopted and has led to the development of chatbots that has the ability to converse with humans on a wide range of topics. Example of such chatbot is the popular OpenAI's ChatGPT, a Chatbot developed based on the GPT 3.5 model.

GPT (Generative Pre-trained Transformer) is a state-of-the-art language model developed by OpenAI. It uses deep learning techniques to craft natural language text closely resembling human-authored content, such as stories, articles, or even conversations. GPT was initially introduced in 2018 as part of OpenAI's transformer-based language model series. Its architecture is based on the transformer, a neural network model that uses self-attention to process input sequences. ChatGPT is based on the GPT 3.5 model. Recently, OpenAI released the GPT 4 which is their most advanced system, producing safer and more useful responses and can solve difficult problems with greater accuracy.

Then Comes Llama2 🔥

Llama 2 is a large language model (LLM) developed by Meta AI. It is a successor to the Llama 1 model, and it is one of the largest and most powerful LLMs currently available. Llama 2 is trained on a massive dataset of text and code, and it can be used for a variety of tasks, including generating text, translating languages, writing different kinds of creative content, and answering your questions in an informative way.

Llama 2 outperforms ChatGPT in most benchmarks, for starters, ChatGPT has 1.5 Billion parameters but Llama 2 comes in three flavors with 7 billion, 13 billion, and a whopping 70 billion parameters.

Other ways it outperform ChatGPT and other Large Language Models includes: generating safer outputs with a higher performance level on a test without coding or math reasoning prompts. Llama 2 is also better at generating up-to-date information, as it is trained on more recent data than ChatGPT. Llama 2 also supports longer context lengths, up to 4096 tokens.

Llama 2 compared to other LLM. Source: Meta AI

This project aims to dive deep into the world of Generative AI and build a chatbot using the Llama2 model.

Tools and IDE

The tools and IDE used in this project are:

Google Colab
Hugging Face Model Hub

Google Colab or Colaboratory, is a web-based hosted Jupyter Notebook service that requires no setup to use, and it provides free access to computing resources, including GPUs and TPUs. Colab is especially well suited to machine learning, data science, and education.

To use Colab, you simply need to create a Google account and then visit the Colab website. Once you are logged in, you can create a new notebook or open an existing one. The reason for using Colab is because of the free GPU it offers which will be essential in building the chatbot.

Hugging Face is a company that develops open-source tools and resources for machine learning. Their flagship product is the Transformers library, which is a popular library for natural language processing (NLP) tasks such as text generation, translation, and question answering.

Hugging Face also provides a variety of other resources for the NLP community, including: Datasets, Model Hub, Tools and Utilities.

All flavors of the Llama 2 model is available on the Hugging face Model Hub. For this project we used the Llama-2-7b-chat-hf model from the hub.

Source: ai.meta.com/llama/

Note that to use the model, you will need to request access from Meta AI before you will be granted access to use the model from Hugging face Model Hub.

Building🛠⚙

After gaining access to use the model, the next step is to build the model.

On our Colab Notebook, we installed the required packages using pip.

!pip install transformers torch accelerate gradio

This will install all the libraries and packages needed to build the chatbot.

When installation is complete, we log into hugging face using the hugging face cli. This is because the model (Llama 2) is a gated model and will require user access token to verify you have access to the model.

What are User Access Tokens?

User Access Tokens are the preferred way to authenticate an application or notebook to Hugging Face services. You can create and manage your access tokens in your settings.

After generating your token, log on to hugging face hub using the hugging face cli.

!huggingface-cli login

Run the cell and enter your token in the space provided for you. If login was successful, you will get a message:

Login successful
Your token has been saved to /root/.huggingface/token

You can also verify you are logged in using the script:

!huggingface-cli whoami

Next, we import the libraries used in building the chatbot:

import transformers
from transformers import AutoTokenizer, pipeline
import torch
import gradio as gr

transformers is an hugging face library based on the Transformer architecture. It has the AutoTokenizer class and the pipeline class which will be used in this project.

torch is a deep learning framework used in building machine learning models.

gradio is a framework for demoing machine learning model with a friendly web interface so that anyone can use it, anywhere.

Next we load the model and tokenizer.

model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)

The model is the Llama-2-7b-chat model from hugging face. AutoTokenizer.from_pretrained() function loads a pre-trained tokenizer from the Hugging Face model hub. The tokenizer is used to convert text into tokens, which are the basic units of information used by the model. The use_auth_token=True argument tells the function to use your Hugging Face authentication token to load the model.

Next, we establish a text generation pipeline that makes it easy to provide prompts to the model and receive generated text as a result. This cell takes about 3-6 minutes to complete.

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

The following are the arguments to the pipeline() function:

task: The task that the pipeline should perform. In this case, the task is text generation.
model: The pre-trained language model to use for text generation.
torch_dtype: The data type to use for the pipeline. In this case, the data type is torch.float16, which is a half-precision floating point data type. This can help to reduce the memory and computational requirements of the pipeline.
device_map: A dictionary that specifies which devices to use for the pipeline. In this case, the device_map is set to auto, which means that the pipeline will automatically select the best devices to use, based on the available hardware.

Open access models, gives us the advantage to have full control over the system prompt. The syntax for prompting Llama 2 model is:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

Building the Prompt:

SYSTEM_PROMPT = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

"""

# Formatting function for message and history
def message_format(message: str, history: list, memory_limit: int = 5) -> str:

    # always keep len(history) <= memory_limitif len(history) > memory_limit:
        history = history[-memory_limit:]

    if len(history) == 0:
        return SYSTEM_PROMPT + f"{message} [/INST]"

    formatted_message = SYSTEM_PROMPT + f"{history[0][0]} [/INST] {history[0][1]} </s>"

    # Handle conversation historyfor user_msg, model_answer in history[1:]:
        formatted_message += f"<s>[INST] {user_msg} [/INST] {model_answer} </s>"

    # Handle the current message
    formatted_message += f"<s>[INST] {message} [/INST]"

    return formatted_message

From the code snippet above, we set up the information that guides how the model responds to user input. The SYSTEM_PROMPT variable contains the initial guidance for the model, explaining its behavior.

Breaking down the parameters: message is the current message we want to send to the model, and history Is like a record of the conversation, stored as a list of pairs. Each pair includes what the user said and how the model responded. For example, [(user_msg1, bot_msg1), (usr_msg2, bot_msg2), (user_msg3, bot_msg3),...].

We use this information to format the conversation in a way that makes sense to the model, with the user's messages and the model's responses clearly separated. This helps the model understand the context and generate more relevant replies.

After building the prompt, we generate a response for the Llama 2 model.

# Generate a response from the Llama model
def llama_response(message: str, history: list) -> str:

    query = message_format(message, history)
    response = ""

    sequences = pipeline(
        query,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1050,
    )

    generated_text = sequences[0]['generated_text']
    response = generated_text[len(query):]  # Remove the prompt from the output

    print("Chatbot:", response.strip())
    return response.strip()

This function, llama_response, talks to the Llama 2 model. It takes a user's **message** and conversation history, prepares it for the model, and gets a response. The model generates text, and we remove the initial part that contains the prompt. The maximum response length is limited to 1050 characters to keep it manageable. Finally, the response is printed and returned.

Now we’ve successfully prompted the model and gotten a response from it, the next thing was to build a simple chat interface using Gradio’s ChatInterface.

ChatInterface in Gradio is a user-friendly way to build interactive chatbot user interfaces. With just a few lines of code.

import gradio as gr

gr.ChatInterface(llama_response).launch(share=True)

This code creates a simple Chatbot interface to interact with Llama 2-7b-chat model. By setting share=True, you get a shareable link which will be active for 7 hours.

Chatbot interface built with Gradio

Conclusion

We discussed how to use Google Colab to build a Chatbot using Llama 2 model. We also discussed the benefits of using Google Colab and Hugging Face for NLP projects.

If you are interested in training or using any large language model, I encourage you to try Google Colab. It is a free and easy-to-use platform that provides access to powerful computing resources.

If you are working on an NLP project, I also encourage you to use Hugging Face tools and resources. Hugging Face provides a wide variety of pre-trained models, easy-to-use tools and utilities, and a large and active community of users.

Hugging Face Library makes it easy to create a pipeline to chat with Llama 2 or any other open-source LLM.

Link to the code on GitHub: EddyEjembi/Llama2_ChatBot

NaN

Victor Obaro

2 years ago

This is a cool project, can I collaborate with you to improve on it?

0 Likes