vLLM Ray Serve deployment

This guide provides step-by-step instructions to deploy a vLLM-supported model using Ray Serve on a Kubernetes cluster. The deployment exposes an OpenAI-compatible API for inference and supports advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.

Prerequisites

  • A Kubernetes cluster with KubeRay installed.

  • GPU-enabled nodes for efficient inference.

Procedure

  1. Navigate to the inference-service Helm Chart, under the infra directory and configure your deployment, using the values.yaml file. Here’s a list of the options:

    Ray Service configuration:

    Parameter

    Description

    Default

    inferenceServiceName

    Name of the Ray Service (max 63 chars, DNS-compliant).

    -

    inferenceServiceNamespace

    Namespace where the Ray Service is deployed.

    default

    Application configuration:

    Parameter

    Description

    Default

    name

    Application name (e.g., deepseek-v3, llama-3).

    -

    routePrefix

    Route prefix for serving (e.g., “/”).

    /

    importPath

    Import path for the model.

    lumigator.jobs.vllm.serve:model

    numReplicas

    Number of replicas.

    1

    numCpus

    CPUs allocated per Ray actor.

    -

    workingDir

    Model working directory.

    -

    pip

    List of Python packages.

    ["vllm==0.7.2"]

    modelID

    Model path (Hugging Face ID or local directory).

    -

    servedModelName

    Name of the served model.

    -

    tensorParallelism

    Tensor parallelism (GPUs per node).

    -

    pipelineParallelism

    Pipeline parallelism (number of nodes).

    -

    dType

    Data type (float32, float16, bfloat16, etc.).

    -

    gpuMemoryUtilization

    Fraction of GPU memory allocated for inference.

    0.80

    distributedExecutorBackend

    Executor backend (ray or mp).

    ray

    trustRemoteCode

    Allow custom code from Hugging Face Hub.

    true

    Ray Cluster configuration:

    Parameter

    Description

    Default

    image

    Docker image for Ray nodes.

    rayproject/ray:2.41.0-py311-gpu

    dashboardHost

    Ray dashboard host.

    0.0.0.0

    objectStoreMemory

    Object store memory allocation (bytes).

    1000000000

    persistentVolumeClaimName

    (Optional) PVC for model storage.

    -

    headGroup.resources.limits.cpu

    Head node CPU limit.

    -

    headGroup.resources.limits.memory

    Head node memory limit.

    -

    headGroup.resources.limits.gpuCount

    (Optional) Head node GPU limit.

    -

    headGroup.affinity.gpuClass

    (Optional) GPU class for head node.

    -

    headGroup.affinity.region

    (Optional) Region for head node.

    -

    workerGroup.groupName

    Worker group name.

    worker-group

    workerGroup.replicas

    Number of worker group replicas.

    1

    workerGroup.resources.limits.cpu

    Worker node CPU limit.

    -

    workerGroup.resources.limits.memory

    Worker node memory limit.

    -

    workerGroup.resources.limits.gpuCount

    (Optional) Worker node GPU limit.

    -

    workerGroup.affinity.gpuClass

    (Optional) GPU class for workers.

    -

    workerGroup.affinity.region

    (Optional) Region for workers.

    -

  2. Install the Helm Chart:

    user@host:~/lumigator$ helm install inference-service ./infra/helm/inference-service
    

    Note

    Depending on the model you’re trying to deploy this step may take a while to complete.

Verify

  1. Port-forward the Ray dashboard:

    user@host:~/lumigator$ kubectl port-forward svc/inference-service-ray-dashboard 8265:8265
    

    Navigate to http://localhost:8265 to access the Ray dashboard. Check that the Ray Serve deployment is running and the GPU resources are allocated correctly.

    Note

    The name of the service may vary depending on the name you’ve chosen for the Ray Service.

  2. Port-forward the head node of the Ray cluster:

    user@host:~/lumigator$ kubectl port-forward pod/inference-service-ray-head-0 8080:8000
    

    Note

    The name of the pod may vary depending on the name you’ve chosen for the Ray Service.

  3. Invoke the model:

    user@host:~/lumigator$ curl "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{"model": "<model-name>", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}]}'
    

    The response should be a JSON object with the model’s prediction.

    Note

    Replace <model-name> with the name of the model you’ve deployed.

Conclusion

You’ve successfully deployed a vLLM-supported model using Ray Serve on a Kubernetes cluster. You can now use the model for inference and explore advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.