LLM (Large Language Model)

botond published 2025/11/15, Sat - 05:28 time

Content

 

Overview

The technological explosion in artificial intelligence in recent years has centered around a single key concept: these LLMs, that is, the Large Language Models (Large Language Models). But what exactly does this term mean beyond marketing buzzwords? Technically, LLM is an artificial intelligence model based on Deep Learning, designed to understand, generate, and manipulate human language.

To put it on the technology map: the LLM is Artificial Intelligence (AI) a subset of, within it, Machine Learning branch, more specifically the Deep Learning belongs to the territory, and the Natural Language Processing (NLP) represents the pinnacle of machine learning. These models have no consciousness or worldview; they are actually highly advanced probabilistic engines. By learning from vast amounts of text data (essentially a large portion of the internet), they are able to recognize complex patterns and relationships in language, and then statistically predict what the next word or phrase is most likely to be in a given context.

This encyclopedia entry explores the world of LLMs from a systems administrator and developer perspective. We don't just scratch the surface, we dig deeper under the hood: understanding transformer architecture, the mathematics of tokenization, training phases, and the role of open source models in modern IT infrastructures.

Attention!
Given the brutal, daily development of the industry, I would like to point out that this description is November 2025 It reflects current conditions, so changes may occur even a month later. So let's look at the information described here in this way!

 

 

History and development

Language modeling is not a new science, its roots go back to statistical linguistics. Early attempts, such as N-gram models, still worked on a purely statistical basis: they simply counted how often certain words followed each other. However, these systems were unable to handle long-term relationships; they did not "remember" what happened at the beginning of the sentence.

The real breakthrough is Google brain researchers' now legendary 2017 study, "Attention Is All You Need" brought. This publication presented the Transform (transformer) architecture, which broke with the previous, sequential processing (RNN, LSTM). The main innovation of transformers is the "self-attention" (self-attention) mechanism that allowed the models to process all words in the input text in parallel and to understand distant relationships between words (e.g., the relationship between a pronoun and its corresponding noun, even paragraphs away). This allowed for a drastic increase in the size of the models and the amount of training data.

After that, the real competition began. OpenAI built on the transformer architecture and created the GPT (Generative Pre-trained Transformer) series, while Google BERT model started towards text understanding. The development became exponential: the number of parameters of the models jumped from the order of millions to the range of hundreds of billions. The technology exploded into public consciousness at the end of 2022, with the appearance of ChatGPT, democratizing access to advanced AI. In parallel, Meta (Facebook) released Calls models started the open source (or rather open weight) model revolution, allowing anyone to now run high-performance LLMs, even on home hardware.

 

Its operation and technological background

To understand the capabilities and limitations of LLMs—especially when integrating them via API—it is essential to understand how they "see" the world and generate responses. Behind the apparent intelligence lies rigorous mathematics and statistics.

Tokenization: the mathematics of text

The first and most important thing to know is that LLMs see numbers, not wordsThe input text is a tokenizer (tokenizer) component into small units, called for tokens breaks it down. A token can be a whole word, a word fragment, or even a single character. On average, 1000 tokens in the English language correspond to about 750 words (in Hungarian, due to conjugation, this ratio is a little worse, the same text requires more tokens).

Each unique token is represented by a number (ID). When we send the text "Hello world!", the model actually receives a sequence of numbers (e.g. [1532, 834, 0]This process is critical for developers, as the cost of most AI services (like the OpenAI API) and the "memory" of models (contextual windows) are measured in tokens, not characters or words.

Tip: Token counting

During development, we often need to check how many tokens a text occupies. An excellent tool for this is OpenAI's official Tokenizer page, where we can visually see how the model breaks up our texts.

Transformer architecture and the Attention mechanism

The heart of the processing is the already mentioned Transform architecture. The key to this is Attention mechanism (Attention Mechanism). Imagine reading a long sentence: "The cat, after eating the food, fell asleep on the sofa because ő "He was very tired." To understand who "he" refers to, we must refer back to "the cat."

Transformers solve this mathematically: each token is assigned a weight that indicates how important that token is relative to the other tokens. The model "listens" to all the other words in the sentence while processing the current word. This parallel, all-to-all relationship allows LLMs to understand extremely complex, long-term relationships and nuances in text that previous technologies could not.

Next-Token Prediction

As magical as the answer may seem, the basic workings of LLMs are surprisingly simple: they try to predict the next tokenWhen the model receives an input, it calculates the probability of the possible next tokens.

For example, if the input is: "The Hungarian capital...", the model assigns a high probability to "Budapest" token, and low on "Paris" vagy "potato" tokens. It selects the winner, appends it to the input, and the process starts over. This recursive generation continues until the model generates a special "Stop" token or reaches the maximum length. The "creativity" (which is often Temperature parameter) lies in the fact that it sometimes chooses not the most likely token, but a little more unexpected, thus making the text more human and diverse.

Multimodality: Beyond mere text

One of the most important technological leaps of 2025 is the multimodality While earlier LLMs could only process text (text-only), the latest generation (e.g. Gemini 3, llama 4, GPT-5) already natively multimodalThis means that the model can not only read, but also "see" and "hear": it can interpret images, videos and audio files.

Technically, this is achieved by transforming visual or audio inputs into the same mathematical vector space (embedding space) as text tokens by a special encoder. For the model, an image is simply a "sequence of tokens", similar to a sentence. This capability is also revolutionary in DevOps and operations: system administrators can now simply submit a screenshot of an error message or a graph, and the model can visually interpret the problem, which might be more difficult to infer from plain text log files.

 

 

How is an LLM done? The teaching phases

A modern language model is not born "ready-made"; it is the result of a long, multi-step and extremely resource-intensive process. Developers distinguish three main phases during which the model turns from a raw data set into a helpful assistant.

Pre-training

This is the most expensive and longest part of the process. In this phase, the model is "fed" with an enormous amount of text data (books, websites, Wikipedia, source codes) amounting to several terabytes. Learning unattended (unsupervised), which means that the human does not say the correct answer, but the model has to figure out the language patterns itself by trying to guess the hidden words from the text.

At the end of the pre-training, the model (which in this case Base Model(we call it) learns grammar, facts about the world, and is able to generate coherent text. However, in this state, it is not yet suitable for a chat assistant: if we ask it "What is the capital of France?", it may not answer, but continue the question in the form of a quiz ("And what is Germany's?"), since it often saw such a pattern in the training data (on the Internet).

Fine-tuning

To make the model a useful tool, it must be taught to follow instructions (Instruction Tuning). In this phase, a small but very high-quality human-built question-answer They are trained on a data set. For example:
Instructions: "Translate this into English!"
Input: "Hello world"
output: "Hello world".

This process transforms raw lexical knowledge into a problem-solving ability. In the name of most open source models, "Instruct" vagy "Chat" suffix indicates that it has passed through this phase (e.g. Llama-4-70b-Instruct).

RLHF: Human Feedback Based Learning

The most advanced models (such as Gemini, ChatGPT or Claude) also go through a third stage, which is Reinforcement learning from human feedback(RLHF). Here, the model generates multiple possible answers to a question, which are evaluated and ranked by human testers (which one is more useful, safer, more truthful). From this feedback, the model learns "correct behavior" and human preferences. This step is also crucial for the safety of the model, for example, not answering dangerous or illegal requests.

 

Key Concepts: Parameters and Context

When choosing an LLM – whether it's a cloud-based API or a model running on your own server – there are two important technical parameters that determine the model's capabilities and resource requirements.

Number of parameters

We often see numbers in model names, such as 7B, 13B vagy 70B. The "B" here is the Billion (billion) abbreviation, and the model parameters The parameters are the internal variables (weights) that the model has learned during training. In simple terms, we can think of it as the "brain capacity" of the model or the number of connections between neurons.

  • Smaller models (7B - 13B): They store less knowledge and nuance, but in return they are fast and run on an average home computer or a small server (with even 8-16 GB of VRAM). They are ideal for simpler tasks, summarization or classification.
  • Large models (70B+): They have vast lexical knowledge and logical abilities, and are able to understand complex relationships and program. However, they require a large server park (several high-performance GPUs) to run.

Context Window

The other critical characteristic is the context window size, given in tokens (e.g. 4k, 32k, 128k). This is the "short-term memory" of the model: it determines how much text it can hold. at once see and process. This includes the system prompt, the user's question, the entire conversation history, and any attached documents.

If the conversation length exceeds this window, the model "forgets" the oldest information. A major direction of modern developments is to increase this window (nowadays there are windows of 1 million tokens, for example, in the case of Google Gemini models), so that models can see through entire books or huge code bases at once. This limit gave rise to solutions such as RAG (Retrieval-Augmented Generation) and advanced memory management protocols.

 

 

Ecosystem of Open and Closed Models

The AI ​​world is currently divided into two major camps. The choice between them is not only a matter of cost, but also of data security and control.

Proprietary models

These include the most well-known commercial services: OpenAI GPT models, Google Gemini models or Anthropic Claude system. We call them "closed" because the weights (the "brains" of the model), the training data, and the exact architecture are considered trade secrets. Users and developers can only access them through an API (Application Programming Interface) or through the model manufacturer's own web chat interface (which also uses an API).

Their advantage: These are currently the smartest, most advanced models on the market. They don't require any hardware of their own and are ready to use right out of the box.
Their disadvantages: We have to transfer all our data to an external company's server (data protection risk), there is a fee for using it, and we are subject to changes by the service provider.

Open Weights models

This category is closest to the philosophy of Linux and Open Source. Although not all of them meet the strict definition of "Open Source" (where the training data is also public), the weights of the models can be freely downloaded. The most important players here are Meta Calls series, Microsoft Phi-3 / Phi-3.5, the French Mistral AI models, or Alibaba Qwen series.

Their advantage: Full control. You can download and run the model on your own Linux server or Windows computer (e.g. Don't or the vLLM (using ), even without an internet connection. Our data never leaves our own infrastructure. It is constantly being fine-tuned and optimized by the community.
Their disadvantages: They require powerful hardware (especially GPU VRAM) to run, and their "raw" performance is still a bit behind the largest closed models, although the difference is rapidly decreasing.

 

Most common models

The AI ​​market is changing extremely rapidly, and by the end of 2025, the following models will set the industry standards.

  • OpenAI: They are the market leaders in generative AI. Their models are closed source.
    • GPT-5.1: The current top model (updated November 2025), with friendlier, more understandable responses and deeper logic capabilities.
    • GPT-4.5 (Orion): OpenAI's largest model (February 2025), with excellent writing skills, better world knowledge, and fewer hallucinations - for premium tasks.
    • GPT-4o: Fast, multimodal workhorse (image/audio handling), balanced performance for everyday use.
    • GPT-4o mini: Most cost-effective small model with lightning-fast response time – ideal for simpler API calls and chat.
  • Anthropic: A company that prioritizes security and vast context.
    • Claude 4 (Opus, Sonnet, Haiku): The generation released in May 2025, offering more natural communication and outstanding long-text processing.
  • Google: The search giant's solutions are integrated into its own ecosystem, and with their 1M context window size, they are among the most powerful models on the market in this regard.
    • Gemini 3 Pro: The top model debuted in November 2025, with market-leading Context Caching capabilities.
    • Gemma 2: Google's updated open-weight model family.
  • Meta (Facebook): They are the biggest supporters of open source AI.
    • Llamas 4: The model family, released in the spring of 2025, is the new benchmark for open models with its 400B+ parameter number.
  • xAI (Elon Musk): A development company focusing on real-time data.
    • Grok 4: The latest, "uncensored" and multimodal model that integrates tightly with the X (Twitter) platform.
  • Mistral AI: European-developed models optimized for efficiency.
    • Mixtral (new, 2025 generation): Updated versions of resource-efficient models using the "MoE" (Mixture of Experts) architecture.
  • DeepSeek: One of the new favorites of the open source community.
    • DeepSeek V3.1: A powerful open model specifically optimized for coding and mathematical tasks.

Comparison table (2025 Q4)

Model family Latest version Parameters (estimated) Context window Type
GPT (OpenAI) GPT-5.1 ~2 Trillion Nagy Closed
Gemini (Google) Gemini 3 Pro Not public 1M+ tokens Closed
Claude (Anthropic) Claude 4 Opus Not public Powerful Closed
Llama (Meta) llama 4 400B + 128K+ tokens Open
Mistral Mixtral (new gen.) 8x7B+ Kozepes Open
Grok (xAI) Grok 4 Not public Nagy Closed

 

Areas of application

LLMs are more than just chat partners; they are versatile tools that are revolutionizing workflows in many industries, but especially in IT. Let's look at some concrete examples of how this technology can be utilized in everyday life.

  • Coding and software development: One of the most common uses. Models can write complete functions, optimize existing code (refactoring), generate unit tests, or even explain the operation of a complex piece of code. Many developers use them as "pair programmers", i.e. digital companions, to increase efficiency.
  • System Administration and DevOps: For a Linux system administrator, LLM can be an invaluable aid. It can write complex Bash scripts, analyze complex error messages or log files, and make suggestions for server configuration files (e.g. Apache, Nginx, Docker Compose, etc.) settings.
  • Data processing and analysis: The models excel at processing, categorizing, and summarizing unstructured text data (e.g., customer feedback, emails, reports). They can interpret tabular data or write SQL queries based on natural language instructions.
  • RAG (Retrieval-Augmented Generation): This is perhaps the most exciting corporate use case. The "Talking to Our Own Data" concept allows us to connect LLM to our own knowledge base (e.g. company documentation, manuals, legal repository). In this case, the model does not rely only on the memories of the trained person, but responds based on documents presented to it by a self-developed search engine (also from our own vector database), accurately and verifiably.
  • Context Caching: A new technology that serves as an alternative or complement to RAG, led by Google (Gemini) and Anthropic. The essence of it is that a huge amount of static information (e.g. an entire company policy, code base or manual library) is sent to the model once, which is "cache-stored". Thus, during subsequent conversations, this huge amount of data does not have to be sent (and paid for) again and again for each question - but only by referring to the identifier of the data set - which allows for the cost-effective operation of extremely complex, deep-knowledge company assistants. However, the storage of this data is usually billed on a time basis, so you need to consider how much demand there is for accessing the data set over a given period of time. If you need to access this data often, it is more worthwhile, while if you only need to access it rarely, it is less so. So it depends on the amount of data stored and the frequency of access.
  • Autonomous agents and tools: The latest evolutionary step of the models, where the LLM is not only a passive responder, but also an active actor. It is able to use external software tools, run commands or initiate API calls in order to solve tasks. This is explained in detail in in the next chapter you can read.

 

 

The New Generation: Agents and Tool Calling

One of the most important technological innovations of 2025 is the transformation of LLMs into "agents". While traditional models only generate text (e.g. write an SQL query), an Agent can to execute also that.

The technical basis for this is Tool Calling (device call). The model is able to recognize that the user's own lexical knowledge is not sufficient to fulfill the user's request, so it calls a predefined external device (e.g. a calculator, a Google search, or a Linux terminal command). This communication and the connection of devices is intended to be standardized by the new open standard, originally developed by Anthropic and since then widely adopted, MCP (Model Context Protocol), which provides a unified interface between AI models and the outside world. The model is stored in a structured format (usually JSON) formulates the command, the system executes it, and then sends the result back to the model, which responds to the user based on it.

DevOps and infrastructure automation

In IT operations, this capability can drastically reduce manual errors and time spent on routine tasks by up to 30-50%. Here are some practical examples:

  • CI/CD pipeline analysis: "Investigate the latest Jenkins build error!"
    → The Agent retrieves the logs, analyzes the error message, and suggests (or even executes) a fix Bash script.
  • Server monitoring: "Which of my containers is using the most CPU?"
    → The model invites the docker stats command, analyzes the output, and makes optimization suggestions.
  • Incident response: "Nginx 502 error, what should I do?"
    → Checks the configuration files, restarts the service, and checks the status.
  • Infrastructure as Code: "Deploy 3 new VMs with Terraform!"
    → Generates the Terraform file and runs it after our approval apply command.

ISPConfig and web server administration

The technology can also be integrated into specific environments such as ISPConfig hosting panel. A well-configured Agent can translate natural language commands into API calls:

  • Create a website: "Create a new WordPress site for client123, domain: ujweboldal.hu, with PHP 8.3!"
    → The Agent creates the client via ISPConfig's REST API, configures Apache Virtual hosting-to, the PHP version and runs the Let's Encrypt certbot on SSLto.
  • Error diagnostics: "Why isn't domain.hu loading?"
    → The model runs in the background for a tail -f /var/log/apache2/error.log command, queries the site status from the API, and if necessary, makes a suggestion to .htaccess to improve it.
  • Bulk maintenance: "Update the backup rules for all WordPress sites!"
    → Retrieves a list of all websites and sets up the appropriate cron job for each in a cycle in ISPConfig.

 

Local execution and future trends

With the democratization of technology and the development of hardware, two defining trends have emerged by 2025: resource-efficient home/server running, and deepening the "thinking" ability of models.

Efficient local execution: llama.cpp and quantization

While industrial-scale GPUs were previously required to run an LLM, the open source community has created revolutionary tools for Linux users.

  • llama.cpp and GGUF format: This project made it possible to run large models efficiently not only on GPUs, but also on a regular processor (CPU), even on an average laptop or server. The quantization (quantization) technique reduces the precision of the model parameters (e.g. from 16 bits to 4 bits), which drastically reduces the memory requirement by up to a quarter with minimal quality degradation. This makes it possible to run a 70 billion parameter model on a more powerful home computer.
  • Kubernetes and AI clusters: In a large corporate environment, vLLM and the TGI (Text Generation Inference) The use of containers has become standard. By scaling them under Kubernetes, organizations can build their own private AI cloud that dynamically manages GPU resources.

Reasoning models

Another big innovation of 2025 is the emergence of "System 2" thinking in AI (e.g. OpenAI o1 series, Google Gemini Deep Thinking). Traditional LLMs "spit out" the answer immediately. In contrast, reasoning models "take their time" before providing an answer: they chew through an internal chain of thought, plan, check their own logic, and only communicate the final result to the user. This operation drastically reduces hallucinations in mathematical and programming tasks, but in return it is slower and more expensive to run.

 

 

Challenges and limitations

While the capabilities of LLMs are impressive, it is important to be aware of their limitations. The most common problem is hallucination: since the model works on a probabilistic basis, it may state completely false facts with full confidence, just because the order of the words seems statistically correct. Therefore, their use in mission-critical systems without human supervision is not recommended.

A further challenge is bias. Since the models were trained on data found on the Internet, they inevitably adopted the stereotypes and prejudices found there. Finally, the resource requirement neither: training and running large models (inference) consumes huge amounts of energy and specialized hardware (GPU), which raises both environmental and economic issues.