What are LLMs? An intro into AI, models, tokens, parameters, weights, quantization and more

To keep up with everything happening in the world of artificial intelligence, it helps to understand and grasp key terms and concepts behind the technology.

In this introduction, we are going to dive into what is generative AI, looking at the technology and models they are built on. We'll discuss how these models are built, trained, and deployed into the world.

We'll also dive into questions like, "How "large" is a large language model?" and take a look at the relationship between model size and performance. During this part, we will also cover terms you might have heard like parameters, weights, and tokens.

Lastly, we'll explore when you would want to reduce a model size. During this part, we will also go over how quantization and sparsity are two techniques that effectively reduce a model's size.

What is AI, machine learning, and a model?

First things first, large language models are a subset of AI (artificial intelligence) and ML (machine learning.) UC Berkeley defines AI and machine learning as follows:

AI refers to any of the software and processes that are designed to mimic the way humans think and process information.
On the other hand, machine learning specifically refers to teaching devices to learn information given to a dataset without manual human interference.

Then there are models. In artificial intelligence, a model is a representation of a system or process that is used to make predictions or decisions based on data. In other words, it is a mathematical algorithm that is trained on a dataset to learn patterns and relationships in the data. Once trained, the model can be used to make predictions or decisions on new data.

Training and inference are different stages of a model's lifecycle

Training and inference are two distinct phases in the lifecycle of a machine learning model.

During training, the model learns from the input data and adjusts its parameters to minimize the difference between its predictions and the actual target values. This process involves backpropagation, optimization algorithms, and iterative updates to the model's parameters.

Inference is the phase where the trained model is used to make predictions on new, unseen data. During inference, the model takes input data and generates output predictions based on the learned patterns and relationships in the training data. Inference is typically faster and less computationally intensive than training, as the model's parameters are fixed and do not need to be updated.

What are Large Language Models (LLMs)?

While there are many different kinds of models, today we are going to focus on large lanaguage models.

LLMs are a type of computer program that can recognize, understand, and generate human language. Built on machine learning, they are trained on huge sets of data, which explains where the name "large" comes from. They are used to generate human-like text, answer questions, and perform other natural language processing tasks.

Parameters versus Weights versus Tokens

When talking about large language models (LLMs), it's helpful to understand the distinction between parameters, weights, and tokens. These terms are often used interchangeably, but they refer to different aspects of the model's architecture, training, and input/output.

Parameters: Parameters are variables that the model learns during the training process. These parameters are adjusted through backpropagation to minimize the difference between the model's predictions and the actual target values.
Weights: Weights are a subset of the parameters in a model that represent the strength of connections between variables. During training, the model adjusts these weights to optimize its performance. Weights determine how input tokens are transformed as they pass through the layers of the model.
Tokens: Tokens are the basic units of input and output in a language model. In natural language processing tasks, tokens typically represent words, subwords, or characters. During training and inference, the LLM processes input text as a sequence of tokens, each representing a specific word or symbol in the input text. The model generates output by predicting the most likely token to follow a given sequence of input tokens.

What makes a large language model "large"?

The size of a language model can be measured in several ways, depending on the context and the specific characteristics of the model. Some common metrics used to describe the size of a language model include:

Parameter Count: The number of parameters in an LLM typically represents the size or complexity of the model, with larger models having more parameters.
Memory Footprint: The size of a model in terms of memory footprint can also indicate its scale. Large models often require significant amounts of memory to store their parameters during training and inference.
Compute Requirements: The computational complexity of training and running a model can also indicate its size. Larger models typically require more computational resources (such as CPU or GPU cores) and longer training times.

Smaller versus Larger Model Sizes

The size of a large model can vary widely depending on factors such as the complexity of the task it's designed for. For example, models used for tasks like natural language processing (NLP) or computer vision may tend to be larger due to the complexity of the underlying data and tasks.

There has been a trend towards increasingly larger models in the field of deep learning, driven by advances in hardware, algorithms, and access to large-scale datasets. For example, models like OpenAI's GPT-3 and Google's BERT have billions of parameters, pushing the boundaries of what was previously considered "large.”

In general, we can categorize language models into three broad categories based on their size:

Small models: Less than ~1B parameters. Some of Apple's OpenELM models, TinyLlama and tinydolphin are examples of small models.
Medium models: Roughly between 1B to 10B parameters. This is where Mistral 7B, Phi-3, Gemma from Google DeepMind, and wizardlm2 sit. Fun fact: GPT 2 was a medium sized model, much smaller than its latest versions.
Large models: Everything above 10B of parameters. This is where Llama 3, Llama 2, Mistral 8x22B, GPT 3, and most likely GPT 4 sit.

The Relationship Between Model Size and Performance

The size of a language model can have a significant impact on its performance and accuracy. In general, larger models tend to perform better on complex tasks and datasets, as they have more capacity to learn complex patterns and relationships in the data.

That being said, large models require more computational resources to train and run, making them more expensive and time-consuming to develop and deploy. Additionally, large models may be more prone to overfitting, where the model learns to memorize the training data rather than generalize to new, unseen data.

Reducing Model Size with Quantization and Sparsity

Large language models can be computationally expensive to train and deploy, making it challenging to scale them to real-world applications. To address this challenge, techniques have been developed to reduce the size of language models while maintaining their performance and accuracy.

Sparsity introduces zeros into the parameters (weights) of the model to reduce its overall size and computational complexity. Sparse models have a significant portion of their parameters set to zero, resulting in fewer non-zero weights and connections. This reduces memory footprint and computational requirements during both training and inference.
Quantization involves reducing the precision of numerical values in the model, typically by representing them with fewer bits. For example, instead of using 32-bit floating-point numbers (float32), quantization may use 8-bit integers (int8) or even fewer bits to represent weights, activations, and other parameters of the model.

Checkout Patrick from Ollama's lightning demo from the AI Developer Meetup last month about how quantization works.

Ollama on Koyeb

Koyeb deploys your AI applications globally in minutes

Deploy Ollama

Deploying LLMs and AI Workloads into the World

In this article, we've covered the surface of AI, machine learning, and large language models. We've discussed the importance of model size, the relationship between model size and accuracy, and techniques to reduce model size such as quantization and sparsity.

If you are looking for the fastest way to deploy your inference models worldwide, give our platform a test drive. Get ready to deploy on high-end infrastructure, with a single click and no stress DevOps.