Skip to main content

docker


title: Docker description: Install Cortex using Docker.

warning

🚧 Cortex.cpp is currently in development. The documentation describes the intended functionality, which may not yet be fully implemented.

Setting Up Cortex with Docker

This guide walks you through the setup and running of Cortex using Docker.

Prerequisites

  • Docker or Docker Desktop
  • nvidia-container-toolkit (for GPU support)

Setup Instructions

  1. Clone the Cortex Repository


    git clone https://github.com/janhq/cortex.cpp.git
    cd cortex.cpp
    git submodule update --init

  2. Build the Docker Image

    • To use the latest versions of cortex.cpp and cortex.llamacpp:

      docker build -t cortex --build-arg CORTEX_CPP_VERSION=$(git rev-parse HEAD) -f docker/Dockerfile .

    • To specify versions:

      docker build --build-arg CORTEX_LLAMACPP_VERSION=0.1.34 --build-arg CORTEX_CPP_VERSION=$(git rev-parse HEAD) -t cortex -f docker/Dockerfile .

  3. Run the Docker Container

    • Create a Docker volume to store models and data:

      docker volume create cortex_data

    • Run in GPU mode (requires nvidia-docker):

      docker run --gpus all -it -d --name cortex -v cortex_data:/root/cortexcpp -p 39281:39281 cortex

    • Run in CPU mode:

      docker run -it -d --name cortex -v cortex_data:/root/cortexcpp -p 39281:39281 cortex

  4. Check Logs (Optional)


    docker logs cortex

  5. Access the Cortex Documentation API

  6. Access the Container and Try Cortex CLI


    docker exec -it cortex bash
    cortex --help

Usage

With Docker running, you can use the following commands to interact with Cortex. Ensure the container is running and curl is installed on your machine.

1. List Available Engines


curl --request GET --url http://localhost:39281/v1/engines --header "Content-Type: application/json"

  • Example Response

    {
    "data": [
    {
    "description": "This extension enables chat completion API calls using the Onnx engine",
    "format": "ONNX",
    "name": "onnxruntime",
    "status": "Incompatible"
    },
    {
    "description": "This extension enables chat completion API calls using the LlamaCPP engine",
    "format": "GGUF",
    "name": "llama-cpp",
    "status": "Ready",
    "variant": "linux-amd64-avx2",
    "version": "0.1.37"
    }
    ],
    "object": "list",
    "result": "OK"
    }

2. Pull Models from Hugging Face

  • Open a terminal and run websocat ws://localhost:39281/events to capture download events, follow this instruction to install websocat.

  • In another terminal, pull models using the commands below.


    # Pull model from Cortex's Hugging Face hub
    curl --request POST --url http://localhost:39281/v1/models/pull --header 'Content-Type: application/json' --data '{"model": "tinyllama:gguf"}'


    # Pull model directly from a URL
    curl --request POST --url http://localhost:39281/v1/models/pull --header 'Content-Type: application/json' --data '{"model": "https://huggingface.co/afrideva/zephyr-smol_llama-100m-sft-full-GGUF/blob/main/zephyr-smol_llama-100m-sft-full.q2_k.gguf"}'

  • After pull models successfully, run command below to list models.


    curl --request GET --url http://localhost:39281/v1/models

3. Start a Model and Send an Inference Request

  • Start the model:


    curl --request POST --url http://localhost:39281/v1/models/start --header 'Content-Type: application/json' --data '{"model": "tinyllama:gguf"}'

  • Send an inference request:


    curl --request POST --url http://localhost:39281/v1/chat/completions --header 'Content-Type: application/json' --data '{
    "frequency_penalty": 0.2,
    "max_tokens": 4096,
    "messages": [{"content": "Tell me a joke", "role": "user"}],
    "model": "tinyllama:gguf",
    "presence_penalty": 0.6,
    "stop": ["End"],
    "stream": true,
    "temperature": 0.8,
    "top_p": 0.95
    }'

4. Stop a Model

  • To stop a running model, use:

    curl --request POST --url http://localhost:39281/v1/models/stop --header 'Content-Type: application/json' --data '{"model": "tinyllama:gguf"}'