Running An LLM With Llama.cpp Using Docker On A Raspberry Pi

1st March 2026 - 25 minutes read time

I've been curious about integrating AI agents into my workflow recently, and so I started looking at how this could be done using my current equipment. Data sovereignty is important to me so sending all my data to train a remote AI agent doesn't appeal. I was expecting to need to buy a new gaming rig with a couple of high end graphics cards in it, but after some research I found that this wasn't the case.

I found a system called llama.cpp, which is an efficient LLM engine written in C++. The idea behind llama.cpp is that you can host small, efficient AI agents without having to throw thousands at equipment to get them running. As I have a Raspberry PI model 5, with 16GB of RAM in my office I thought this was a good candidate to get running.

In this article we will look at how to get an LLM running on a Raspberry PI via a docker container, and what sort of things we might be able to do with it.

Building The Docker Container

The docker file here is pretty simple. We need to start from a base model of Debian, install some dependencies, and then download and compile llama.cpp. Here is my docker file.

FROM debian:trixie

ADD ./models /models

RUN apt update && apt install -y build-essential cmake git libcurl4-openssl-dev

WORKDIR /opt/llama
RUN git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp

WORKDIR /opt/llama/llama.cpp
RUN cmake -B build -DLLAMA_CURL=OFF && cmake --build build --config Release

ENTRYPOINT [ "/opt/llama/llama.cpp/build/bin/llama-server" ]

The build option we are sending to llama.cpp is "LLAMA_CURL", which is turned off to prevent llama.cpp from being able to download models.

By setting this option to "off" we need to download and inject our own models into the interface at runtime rather than bake them into llama.cpp at the beginning. This is good as it allows us to inject any number of LLMs to perform different actions, or even add multiple models for different tasks. In other words, we can use the docker container once it is compiled in different ways to accomplish our goals.

Llama.cpp comes with a lot of different flags, and there are a number we could change for this project. We could set the GGML_CUDA flag to off to force the CPU interface instead of using the graphics card, but since the Raspberry Pi doesn't have an internal GPU this wouldn't make any difference.

With the docker file in place we can build the container by running the build command.

docker build -t llama-demo:latest .

On a Raspberry Pi, even a top of the line model 5 with an NVME drive, this takes about 20-30 minutes to complete. Time to make (and perhaps enjoy) a cup of tea whilst it is buidling.

Once it is finished we need to run the docker container, but first we need some LLMs to interface with.

Loading The LLM

I realise that writing this down means that it is instantly out of date. The LLMs created at the moment are being updated and replaced practically daily. As such, when following these instructions it's probably a good idea to do a little bit of research into the models available. Try to get smaller models that will fit into the memory available for the Raspberry Pi

For this project I selected two models.

The Qwen3-VL-2B-Instruct-Q8_0 model is used for normal text processing.
The mmproj-F32.gguf is used to process image requests.

You can download them automatically from the huggingface website using the following commands.

mkdir models && cd models
curl -L -O https://huggingface.co/unsloth/Qwen3-VL-2B-Instruct-GGUF/resolve/main/Qwen3-VL-2B-Instruct-Q8_0.gguf?download=true
curl -L -O https://huggingface.co/unsloth/Qwen3-VL-2B-Instruct-GGUF/resolve/main/mmproj-F32.gguf?download=true

They should end up in the models directory in the same directory as your docker file.

We are now ready to run the docker container.

Running Docker

Now we have the models in place we can start the docker container.

docker run --rm --device /dev/dri/card1 --device /dev/dri/renderD128 \
--volume ${PWD}/models:/models --expose 8000:8000 llama-demo:latest --no-mmap --no-warmup \
--model /models/Qwen3-VL-2B-Instruct-Q8_0.gguf --mmproj /models/mmproj-F32.gguf \
--port 8000 --host 0.0.0.0 --predict 512 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--presence-penalty 1.5

The parameters we pass to the server are varied and actually come from both docker and llama, so let's break them down a bit.

--device = Docker - Adds a host device to the container.
--volume = Docker - Mounts the models directory inside the container as a volume.
--expose = Docker - Publishes the port 8000 of the container to port 8000 on the host.
--no-mmap = Llama.cpp - Whether to memory-map model. By disabling this the model is slower to load, but may reduce pageouts if not using mlock (which we aren't in this case).
--no-warmup = Llama.cpp - Whether to perform a warm up with an empty query.
--model = Llama.cpp - The path to the model.
--mmproj = Llama.cpp - path to a multimodal projector file. In other words, this is the path to our vision-capable model.
--port = Llama.cpp - The port to start the server on. We are using 8000 here and exposing this as port 8000 externally.
--host = Llama.cpp - The host to allow access to the server. In this case it is allowed from anywhere.
--predict = Llama.cpp - The number of tokens to predict. The default for this value is -1, which means infinity, so we need to set this to a value that will fit into memory.
--temp = Llama.cpp - The temperature (default 0.8) this is an indication of how much change the LLM will inject from one response to the other. A value of 0 means that the response will probably be the same for a given input.
--top-p = Llama.cpp - top-p sampling (default 0.9, setting to 1 means disabled). This gives the model less cumulative probability, meaning that the output should be slightly more predictable.
--top-k = Llama.cpp - top-k sampling (default 40, setting to 0 means disabled). This gives the model less items to select from for each given step, meaning that the output should be less creative in our responses.
--presence-penalty = Llama.cpp - Repeat alpha presence penalty (default 0.0, 0.0 = disabled). This will prevent our model from reproducing concepts and words that have already been used in the response.

There are lots and lots of parameters to play with and you can see them on the llama.cpp server documentation page. I would highly advise playing with these values a bit to see how they effect your output. The values detailed above are key to tweaking your model and should be a starting point.

Once the docker container is running it will start up the web interface so that we can ask questions. We can now go to the address of the Pi (or localhost if you are on the Pi desktop) with port 8000 to see the interface.

It should look a little like this.

This is a screenshot of a text-based interface, likely from an AI chatbot or language model application. The background is dark, and the text is in a light gray color. The main title at the top reads "llama.cpp". Below it, there's a prompt: "Type a message or upload files to get started". There is a large input box with the placeholder text "Ask anything...". To the right of this box, there is a button that shows the name of the model used: "Qwen3-VL-2B-Instruct-Q8_0.gguf". The interface also includes instructions for sending messages: "Press Enter to send, Shift + Enter for new line". The overall layout suggests it's a simple text-based interface designed for users to interact with an AI model.

You can use this to interface with the model, and it will work quite well. The response is fairly slow, but it's quick enough that you should be able to keep up with what it is outputting in realtime.

Notice the little paperclip icon in the above image? That means that we can upload a picture to the interface and ask it to generate alt text for an image. A word or warning here, don't upload a massive image file to the interface as it will take a long time to process.

As it happens, AI agents are quite good at extracting meaning from low resolution images that is just as good as high resolution images. This means we can reduce the image down to, say, 400 pixels and the alt text generated for the image is just as good and takes a fraction of the time to generate.

One thing to watch for in this interface is that it lacks the full context that you might be used to with online agents and bigger models. The first couple of interactions work fine, but you soon fill up the context bucket and that causes the server to reject any more requests. There's probably a way to solve this, but I haven't found what parameter or configuration setting does this.

Using Docker Compose

A better way of getting setup is by using a docker compose file, this means you don't need to enter that whole command into the command line every time you want to start the server.

Create a docker-compose.yml file and add the following content.

services
  llama:
    image: llama-demo
    devices:
      - "/dev/dri/card1:/dev/dri/card1"
      - "/dev/dri/renderD128:/dev/dri/renderD128"
    build:
      context: .
    restart: unless-stopped
    ports:
      - 8000:8000
    command: --no-mmap --no-warmup --model /models/Qwen3-VL-2B-Instruct-Q8_9.gguf --mmproj /models/mmproj-F32.gguf --port 8000 --host 0.0.0.0 --predict 512 --temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5
    volumes:
      - ./models:/models:ro

volumes:
  models:

With that in place you just need to run docker-compose run and the LLM server will be ready to go.

Making Use Of Llama.cpp

Outside of the web interface the llama.cpp web server also provides a API interface that we can use that has a comprehensive documentation page on the github repo. I won't post all of it here but there are a couple of commands that make for good tests.

To see if the A PI is running you can go to the /health path of your llama server. This gives a response telling you if everything is ok.

$ curl http://192.168.1.2:8000/health
{"status":"ok"}

You can send a simple prompt to the API using the /completion endpoint, which will respond with to a question with some data.

$ curl --request POST --url http://192.168.178.2:8000/completion --header "Content-Type: application/json" --data '{"prompt": "What is a basset hound?","n_predict": 128}'

You can also use the endpoint /v1/chat/completions to create more complex interactions with the system.

curl http://192.168.178.1:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{
"messages": [
{
    "role": "system",
    "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a bout basset hounds"
}
]}'

I won't post the entire replies to these requests, but they respond in a few seconds with text replies. It's quite easy to build upon these features using some simple API calls.

Drupal Integration

Drupal has a really good ecosystem of AI integrations and it's also possible to integrate with llama.cpp agents.

Require and install the AI module and the included AI Content Suggestions module.

Next, require and install the Ollama Provider module, since the llama.cpp server implements the same protocols as Ollama we can just point Drupal at the out llama.cpp server and the provider module should be able to handle the integration for us. We then need to set up the Ollama provider to point at your llama.cpp integration. You can then use the AI agent to perform certain actions on your content.

Once that is complete you should see the model you loaded into llama.cpp when you ran the docker container as an option in your Drupal configuration.

As a simple example I allowed the AI module to suggest a title for the content on the page, which added a little button to the page creation form that I could click on and pull back an AI generated title (highlighted in red on the right hand side).

Creating a page in Drupal with the suggest dialog opened showing a suggested title based on the content on the page.

This does take a couple of seconds, but it works very well. I added some of the other options to suggest content or to evaluate the readability of the text and it all works very well.

I also installed the AI Agents module and configured the llama.cpp server to be an interactive AI chat bot.

A drupal chatbot, responding to the question asking about a basset hound. The response reads A Basset Hound is a breed of dog known for its long, low-slung body, short legs, and distinctive and is cut off after that.

Again, it was a little slow to respond, but it worked well enough for a local test.

I wouldn't recommend using this chatbot system on a live site with lots of users as it won't be able to keep up with the traffic. I would also recommend not using this system without carefully going through the settings to ensure that the chat bot won't go off the rails and start talking about things that aren't related to your site. In fact it's better to use a RAG system to power the chatbot rather than just a raw AI agent like this.

Conclusion

The LLM we create here is quite slow, but still very usable. The speed is about 4.8 tokens/s, which means you will likely be able to read as the text is generated.

It does, however, work, and shows that you don't need to have a beefy graphics card or a massive computer to set up a local LLM. Just a little bit of patience.

I would highly suggest playing around the parameters a little as it can really change the output. The Top-K and Top-K parameters in particular can really influence the output of the LLM and are worth altering to see the outcome.

The small amount of memory that we have here also means that the context of your requests is quite small. They are certainly big enough that you can include a fair amount of context before the server starts rejecting any more input.

It was also nice to see that I was able to interface the LLM with a Drupal site by just installing a couple of modules. In fact, the Drupal AI experience was quite nice and comes with a lot of sensible defaults for context and field setup that means you don't have to configure everything from scratch.

The models I have chosen above probably aren't the best solutions here either. If you are just using text output then the non-VL Qwen3 models might produce better outputs. Also, the Qwen Q4 models will have a smaller memory footprint and be quite close to Q8 output. Thankfully, due to the way we setup the docker container, starting the image with a new model is very easy. I'll certainly be swapping out the models and tweaking the parameters to see how things change and what works.

This is an excellent way of keeping the data you feed your AI model within your control, which isn't always the case for other AI systems. If you have a spare Raspberry Pi then running a system like this might be a good option when testing things out. The fact that everything is in a docker container means you can get setup easily without having to install too many dependencies.