Deploy a local AI Chat bot on Single Node

In this article we will see how we can easily deploy a local AI chat BOT on a single physical hardware.

The prerequisites to this is basically a Physical server with a good GPU atleast 16GB of vRAM to run a decent LLM, good amount of RAM and a CPU with atleast 16 to 24 threads.

The Config that I am using in this setup is:

CPU: Ryzen 9 5900x
Memory: 80GB
GPU: RTX 5060 Ti.

Once we have the setup then we can proceed with installing the OS, I am using Ubuntu here just for the ease access to a lot of community developments.

Once the Os is installed then we start with the steps as below:

ON THE OS

First, prepare the base operating system by updating it and installing the necessary NVIDIA drivers. A reboot is required to load the new drivers.

#apt update
#apt install nvidia-driver-575-open --> This may change depending on your OS as well as your GPU
#reboot

After Reboot check if OS has recognized the NVIDIA GPU
#nvidia-smi --> should give output with details of NVIDIA graphic card

Next Install NVIDIA Toolkit Next, install the NVIDIA Container Toolkit, which allows container runtimes like Docker and containerd to recognize and use NVIDIA GPUs.

#curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ 
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ 
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

#apt-get update
#apt-get install -y nvidia-container-toolkit

Install RKE2

Set the DNS files:

vim cat /etc/rke2-resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4

Prepare the Config to use Cilium as the CNI and to make the container runtime aware of the NVIDIA runtime for GPU access.

mkdir -p /etc/rancher/rke2/
vim /etc/rancher/rke2/config.yaml

cni: "cilium"
containerd-runtime-config: |
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
    privileged_without_host_devices = false
    runtime_engine = ""
    runtime_root = ""
    runtime_type = "io.containerd.runtime.v1.linux"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
      BinaryName = "/usr/bin/nvidia-container-runtime"
kubelet-arg:
  - "resolv-conf=/etc/rke2-resolv.conf"

Download and start the RKE2 server service

curl -sfL https://get.rke2.io | sudo sh -
sudo systemctl enable rke2-server.service
sudo systemctl start rke2-server.service

Configure Persistent Storage:

Deploy the Local Path Provisioner to enable dynamically provisioned persistent storage using the host’s local disk. This is set as the default

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
kubectl edit configmap -n local-path-storage local-path-config
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Install MetalLB

Install MetalLB to provide Load Balancer services for your cluster, allowing external access to internal services. An IP address pool is configured for MetalLB to assign to services.

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.7/config/manifests/metallb-native.yaml

Create the IP pool configuration file:

vim metallb-config.yaml

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.1.140-192.168.1.150
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: example
  namespace: metallb-system
spec:
  ipAddressPools:
  - first-pool

kubectl apply -f metallb-config.yaml

Install Nvidia GPU Operator

The NVIDIA GPU Operator automates the management of NVIDIA GPU resources in a Kubernetes cluster.

Create a config Map to enable time slicing

vim slice.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

Apply the configuration

kubectl apply -f slice.yaml -n gpu-operator

Add Helm Repos for GPU Operator:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia 
helm repo update

Install the Operator:

helm upgrade --install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator \
  --set toolkit.env[0].name=CONTAINERD_SOCKET \
  --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock \
  --set toolkit.env[1].name=CONTAINERD_CONFIG \
  --set toolkit.env[1].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml \
  --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
  --set toolkit.env[2].value=nvidia \
  --set devicePlugin.config.name=time-slicing-config

Now some time the slicing config is not correctly applied, like it happened in my case and I had to patch it again using the below command:

kubectl patch clusterpolicies.nvidia.com/cluster-policy -n gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

Next Deploy the AI Stack:

Add helm repo for Milvus, Ollama and Openwebui.

helm repo add milvus https://zilliztech.github.io/milvus-helm/
helm repo add open-webui https://helm.openwebui.com/
helm repo add ollama-helm https://jmorganca.github.io/ollama-helm/
helm repo update

Install Milvus

Install Milvus, a vector database essential for Retrieval-Augmented Generation (RAG). It is deployed in standalone mode using thelocal-path storage class for persistence

helm upgrade --install milvus milvus/milvus --set cluster.enabled=false --set etcd.replicaCount=1 --set minio.mode=standalone --set pulsarv3.enabled=false --set standalone.enabled=true  --set persistence.storageClass="local-path" --set persistence.size="200Gi" --set log.level=debug --namespace milvus --create-namespace

Note: Since its a single node I disabled a lot of parameters here for easy deployment, you may want to check and tune the installation depending on your preferences.

Install Ollama:

Install Ollama to serve large language models. The configuration specifies using the nvidia runtime to enable GPU acceleration and allocates persistent storage for models.

Create the Ollama values file:

vim ollama_custom_overrides-2.yaml

runtimeClassName: nvidia
ollama:
  image:
    repository: ollama/ollama
    tag: latest
  gpu:
    enabled: true
    type: 'nvidia'
    number: 1
    nvidiaResource: "nvidia.com/gpu"
  service:
    type: ClusterIP
persistentVolume:
  enabled: true
  storageClass: "local-path"
  size: 500Gi

Install Ollama:

helm install ollama ollama-helm/ollama -f ollama_custom_overrides-2.yaml --namespace ollama --create-namespace

Verify GPU is being used by checking the pod logs and running nvidia-smi inside the container.

Install OpenWebUI

Finally, install OpenWebUI, a frontend for interacting with LLMs.

The configuration disables its built-in Ollama and connects it to the standalone Ollama and Milvus services deployed earlier.

Create the OpenWebUI values file:

cat open-webui-override-2.yaml

ollama:
  enabled: false
pipelines:
  enabled: false
persistence:
  enabled: true
  storageClass: "local-path"
  size: 50Gi
extraEnv:
  - name: OLLAMA_BASE_URL
    value: "http://ollama.ollama.svc.cluster.local:11434"
  - name: DEFAULT_MODELS
    value: "llama3.1"
  - name: RAG_EMBEDDING_MODEL
    value: "BAAI/bge-large-en-v1.5"
  - name: VECTOR_DB
    value: "milvus"
  - name: MILVUS_URI
    value: "http://milvus.milvus.svc.cluster.local:19530"
service:
  type: LoadBalancer

Install the Helm chart:

helm install open-webui open-webui/open-webui -f open-webui-override-2.yaml --namespace open-webui --create-namespace

In my case few parameters didn’t take effect with the config file so i manually patched the deployment as below:

kubectl set env statefulset/open-webui -n open-webui ENABLE_OLLAMA_API='True' ENABLE_DIRECT_CONNECTION='True' OLLAMA_BASE_URL="http://ollama.ollama.svc.cluster.local:11434" ENABLE_RAG='True' DEFAULT_MODELS="llama3.1" RAG_EMBEDDING_MODEL="BAAI/bge-large-en-v1.5" VECTOR_DB="milvus" MILVUS_URI="http://milvus.milvus.svc.cluster.local:19530"

Leave a Comment Cancel reply