Technical Details About Llama 3

5 min readApr 27, 2024

Meta AI has introduced a new tokenizer, enhanced pretraining methods, and various other improvements in their latest model.

Since its first release, Llama has become a key component in the world of open source generative AI. Some people call these releases “open models” because they’re not entirely open source, but that’s just my personal choice. Recently, the popularity of open models has surged even more with the launch of Llama 3.

Must-Read Topics:
Meta Llama 3: It’s set to be the most advanced publicly accessible LLM, Including Gemini
Llama 3 — Exploring 10 Key Features of the Advanced LLM
Attention Mechanism in Large Language Model

The release of Llama 3 builds on incredible momentum within the open model ecosystem and brings its own innovations. The 8B and 70B versions of Llama 3 are available, with a 400B version currently being trained.

The Llama 3 architecture is based on a decoder-only model and includes a new, highly optimized 128k tokenizer. This is quite notable, given that, with few exceptions, most large language models simply reuse the same tokenizers. The new tokenizer leads to major performance gains. Another area of improvement in the architecture is the grouped query attention, which was already used in Llama 2 but has been enhanced for the larger models. Grouped query attention helps improve inference performance by caching key parameters. Additionally, the context window has also increased.

Llama 3 shows significant advancements in its training process compared to earlier versions. The model was trained on a vast amount of data, around 15 trillion tokens, which is quite remarkable considering it’s an 8B parameter model. This indicates the high level of optimization achieved by Meta in this new release. It’s worth mentioning that only a small portion, about 5%, of the training data contained non-English tokens. The training setup utilized a massive infrastructure of 16,000 GPUs, achieving an impressive throughput of 400 TFLOPs, which is a remarkable achievement.

Architecture

Meta AI’s Llama 3 features a standard, decoder-only transformer structure. Llama 3 introduces a tokenizer equipped with a 128K token vocabulary, which enhances language encoding efficiency, significantly boosting model performance. To enhance the inference capabilities, Llama 3 integrates grouped query attention (GQA) across models sized at 8B and 70B. These models are trained with sequences up to 8,192 tokens long, using a masking technique to prevent self-attention across document boundaries.

1 — Tokenizer The newest version of Llama 3 introduces an advanced tokenizer. This tokenizer works with a vocabulary of 128,000 tokens, improved from its previous versions to provide better performance when processing text. Notably, the Llama 3–8B model was trained using an impressive dataset of 15 trillion tokens, made possible by effectively managing its parameters.

2 — GQA Grouped-query attention (GQA) cleverly combines elements of multi-head attention (MHA) and multi-query attention (MQA) to create a more efficient attention mechanism. By storing previously calculated keys and values, GQA reduces the memory requirements as batch sizes or context windows increase, making the decoding process smoother in Transformer models.

3 — RoPE: Llama 3 employs Rotary Positional Encoding (RoPE), a sophisticated encoding mechanism that strikes a balance between absolute positional encodings and relative positional encodings. This method not only retains a fixed embedding for each token but also applies a rotational computation to the vectors, enhancing the model’s attention calculations.

4 — Key-Value (KV) caching is a technique deployed to speed up the inference in autoregressive models like GPT and Llama. By storing previously computed keys and values, the model reduces repetitive calculations, thus expediting matrix multiplications and enhancing overall efficiency.

Training

Meta AI has pre-trained Llama 3 on over 15 trillion tokens gathered from public sources. The training set is seven times larger than that used for Llama 2 and includes a significantly higher volume of code. With more than 5% of the training data consisting of high-quality, non-English content covering over 30 languages, Llama 3 prepares for multilingual applications, although performance in these languages may not equal that in English.

In pursuit of the highest data quality, Meta AI developed sophisticated filtering systems, including heuristic and NSFW filters, semantic deduplication, and text classifiers. These systems were refined using insights from previous model generations, particularly Llama 2, which was instrumental in generating training data for Llama 3’s quality-assurance classifiers.

For its largest models, Llama 3 utilizes a trio of parallelization strategies: data, model, and pipeline parallelization. Its most effective setup reaches over 400 TFLOPS per GPU, facilitated by training on 16,000 GPUs simultaneously within two custom-built 24,000 GPU clusters. Meta AI has also innovated a new training stack that automates error detection, handling, and maintenance to optimize GPU utilization.

Llama 3 Instruct

In refining its pretrained models for chat applications, Meta AI has employed a hybrid of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO). The selection and quality assurance of prompts and preference rankings significantly influence model performance. Moreover, to ensure model safety, these instruction-fine-tuned models undergo rigorous testing, including red-teaming by experts using adversarial prompts to identify and mitigate potential misuse risks.

The Results

Llama 3 achieves top-tier performance across leading industry benchmarks like MMLU and CommonSense QA.

Additionally, Meta AI has curated a new, high-quality human evaluation set comprising 1,800 prompts spanning 12 critical use cases. Access to this set is restricted even within Meta AI to prevent potential overfitting by the modeling teams.

An Impressive Model

Llama 3 is a very welcome addition to the open model generative AI stack. The initial benchmark results are quite impressive, and the 400B version could rival GPT-4. Distribution is one area where Meta excelled in this release, making Llama 3 available on all major machine learning platforms. It’s been just a few hours, and we are already seeing open source innovations using Llama 3.

Technical Details About Llama 3

Architecture

Training

Llama 3 Instruct

The Results

Written by Asad iqbal