Running Llama3:7b on Windows using WSL2

7 min readJun 6, 2024

[Using Hugging Face, Official Llama3:7b weights, Transformers]

Welcome to my blog post on Medium! I am excited to share my insights and knowledge with you. Today, we’ll look at how to run Llama3 on Windows with WSL2.

Table of Content

1. Introduction

2. Prerequisites

Software Requirements
Hardware Requirements
Setting Up WSL2 (Choosing and Installing a Linux Distribution)

3. Setting up the Virtual Environment

Installing required packages and libraries
Generate the HuggingFace-Hub Token

4. Installing Llama3

Download Llama3 official weights from Hugging-Face

5. Running Llama3

6. Additional Resources

1. Introduction

This article will walk you through setting up your environment, leveraging the powerful capabilities of the Nvidia GeForce GTX 1650 and getting started with Llama3, making it easier for developers and AI enthusiasts to use its capabilities on a Windows PC. Let’s dive in to begin this wonderful journey together!

2. Prerequisites

Software Requirements

Windows 10 or 11 (Version 1903 or Higher) necessary for WSL2 Installation
Check your system compatibilities
- Windows 10 release information
- Windows 11 release information

Make sure the system has the following two Windows Feature enabled.
- Virtual Machine Platform
- Windows Subsystem for Linux
Clicking OK will restart your system for the changes to take place.

Python 3.10 or higher

Hardware Requirements

System with at least 16 GB RAM, 100 GB disk space, Optional: GPU (Nvidia Geforce GTX 1650 or higher)

Note: A dedicated graphics processor is recommended to run Llama3 locally on your system. The computational demands of the model make it impractical to run original weights efficiently on a CPU alone.

Setting Up WSL2 (Choosing and Installing a Linux Distribution)

Open the Windows Store and search for Ubuntu.

I’m using Ubuntu 24.04 LTS
Download and open Ubuntu from the Microsoft Store.
Install the latest Linux kernel update package from here.
Launch Ubuntu and create a username and password.

Step 2: Open the above file downloaded from the Microsoft store.
If installing it for the first time you might see the following screen.

Message when launching Ubuntu for First Time

Go to the following link to install the latest Linux kernel Update package.
Once downloaded run the setup file.
Link: https://aka.ms/wsl2kernel
Again Launch your Ubuntu and it’ll ask you to create a Username and a Password.

Upon Creating your User-Id you are set to use Linux using WSL2.

3. Setting up the Virtual Environment

Make sure you have the vs code, python installed in the WSL2
- (By default Ubuntu will install Python V3.x)
- Which python3 returns the python installation path

which python3

Run the following command in your Ubuntu and it will install all the required software.

Install the pip using the command:

sudo apt install python3-pip

Create a Virtual environment to install all the required libraries in your virtual environment.

// To Create a Virtual Environment
python3 -m venv .yourenv-name

// Activate your virutal environment
source .yourenv-name/bin/activate

Once your Venv is activated, install all the required packages necessary to run Llama3 locally

Installing the required packages/libraries.

Create a requirements.txt file with the following packages:

torch
accelerate
jupyter
transformers

pip install -r requirements.txt

This will install all the packages inside the Virtual Environment.
Ensure Nvidia and CUDA Toolkit are set up correctly. In Jupyter, running the below script should display your GPU device name.

Generate the HuggingFace-Hub Token

To generate the token, you should have a hugging face account.

Create a Hugging Face account.
Go to Settings > Access Tokens.
Click “New Token”, name it (e.g., local_llm), and set the type to "Read".
Copy your token and replace Your_Token_ID in the code.

4. Installing Llama3

Download Llama3 official weights from Hugging-Face.

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    token = 'Your_Token ID',
    batch_size = 6
)

model_id: This is the identification for the model you want to use to generate text. In this example, the model we used is “Meta-Llama-3–8B-Instruct” from the “meta-llama” repository. This option provides the model’s architecture and settings.

Hugging-Face repository Link: meta-llama/Meta-Llama-3–8B-Instruct · Hugging Face

pipeline: This function from the transformers package generates a pipeline for text generation. It enables us to effortlessly do text-generating jobs with pre-trained models.

“text-generation”: This option describes the task we want the pipeline to do. In this situation, it’s text generation, which means the pipeline will produce text based on the input prompt.

Model: This option defines the pre-trained model that will be used for text production. It is set to the model_id variable, indicating that the pipeline will use the model identified by the ‘model_id’.

model_kwargs: This option allows you to provide additional keyword arguments to the model during initialization. In this case, it is used to provide the torch_dtype as torch.bfloat16, which changes the data type of computations to bfloat16, a decreased precision floating point format.

device_map: This option specifies which devices to use for model inference. It is set to “auto”, which indicates that the pipeline will select the accessible device(s) based on the system setup. (Other option we can choose CPU, GPU)

token: This argument is where you pass your Token ID generated through Hugging Face.

batch_size: This option determines the batch size for model inference. It’s set to 6, so the pipeline will process input data in batches of size 6. Typically, batch sizes in the range of 4 to 32 are commonly used for LLMs like Llama3.

5. Running Llama3

This code sets up a conversation prompt for text generation using the Hugging Face Transformers library. Let us break down what each part does.

messages = [
    {"role": "system", "content": "You are a French chatbot who always responds in French!"},
    {"role": "user", "content": "Who are you?"},
]

This generates a list of messages containing dictionaries, with each dictionary representing a message in a conversation. The “role” column indicates whether the message is from the system or a user, whereas the “content” field contains the message’s actual text.

Applying Chat Template

prompt = pipeline.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

This applies a conversation template to the messages, prepares them for inclusion in the text creation model. It tokenizes the messages and adds special tokens to denote the start and end of each message, as well as to indicate the start of the generation prompt.

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

pipeline.tokenizer.eos_token_id.:

This section retrieves the tokenizer object’s eos_token_id field from the pipeline.
The end-of-sequence token’s token ID in the vocabulary is represented by the eos token id. This token is usually used to denote the conclusion of a message or a sentence, or the end of a sequence.
The model will cease producing tokens when it comes across the eos token ID while creating text.

pipeline.tokenizer.convert_tokens_to_ids(“”):

The tokenizer’s vocabulary is used to translate an empty string “” to the associated token ID.
A blank string is frequently employed as a separator between several input sequence segments. You are essentially defining a token that may be used as a terminator or delimiter by converting it to its token ID.
This token denotes the end of the generation prompt or the end of a message and can be read as an extra terminator.

Note: By including both, we give the model flexibility in understanding input sequences and deciding when to stop generating tokens. These terminators ensure the model produces coherent and meaningful outputs when generating text based on the conversation prompt.

%%time

outputs = pipeline(
    prompt,
    max_new_tokens=32,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

This above invokes the pipeline function with the conversation prompt and various generation parameters.
prompt: The conversation prompt prepared earlier.
max_new_tokens: Specifies the maximum number of tokens to generate. (used 32 to get the response quick)
eos_token_id: Specifies the terminators used to determine when to end generation. In this case, it uses the terminators list defined earlier.
do_sample: Specifies whether to use sampling during generation. If True, the model will sample from the distribution of predicted tokens; if False, it will use greedy decoding.
temperature: Controls the randomness of sampling. Higher values lead to more diverse outputs.
top_p: Specifies the nucleus sampling threshold. Tokens with cumulative probability exceeding this threshold are filtered out.