vLLM Ray Serve deployment
This guide provides step-by-step instructions to deploy a vLLM-supported model using Ray Serve on a Kubernetes cluster. The deployment exposes an OpenAI-compatible API for inference and supports advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.
Prerequisites
A Kubernetes cluster with KubeRay installed.
GPU-enabled nodes for efficient inference.
Procedure
Navigate to the
inference-service
Helm Chart, under theinfra
directory and configure your deployment, using thevalues.yaml
file. Here’s a list of the options:Ray Service configuration:
Parameter
Description
Default
inferenceServiceName
Name of the Ray Service (max 63 chars, DNS-compliant).
-
inferenceServiceNamespace
Namespace where the Ray Service is deployed.
default
Application configuration:
Parameter
Description
Default
name
Application name (e.g., deepseek-v3, llama-3).
-
routePrefix
Route prefix for serving (e.g., “/”).
/
importPath
Import path for the model.
lumigator.jobs.vllm.serve:model
numReplicas
Number of replicas.
1
numCpus
CPUs allocated per Ray actor.
-
workingDir
Model working directory.
-
pip
List of Python packages.
["vllm==0.7.2"]
modelID
Model path (Hugging Face ID or local directory).
-
servedModelName
Name of the served model.
-
tensorParallelism
Tensor parallelism (GPUs per node).
-
pipelineParallelism
Pipeline parallelism (number of nodes).
-
dType
Data type (
float32
,float16
,bfloat16
, etc.).-
gpuMemoryUtilization
Fraction of GPU memory allocated for inference.
0.80
distributedExecutorBackend
Executor backend (
ray
ormp
).ray
trustRemoteCode
Allow custom code from Hugging Face Hub.
true
Ray Cluster configuration:
Parameter
Description
Default
image
Docker image for Ray nodes.
rayproject/ray:2.41.0-py311-gpu
dashboardHost
Ray dashboard host.
0.0.0.0
objectStoreMemory
Object store memory allocation (bytes).
1000000000
persistentVolumeClaimName
(Optional) PVC for model storage.
-
headGroup.resources.limits.cpu
Head node CPU limit.
-
headGroup.resources.limits.memory
Head node memory limit.
-
headGroup.resources.limits.gpuCount
(Optional) Head node GPU limit.
-
headGroup.affinity.gpuClass
(Optional) GPU class for head node.
-
headGroup.affinity.region
(Optional) Region for head node.
-
workerGroup.groupName
Worker group name.
worker-group
workerGroup.replicas
Number of worker group replicas.
1
workerGroup.resources.limits.cpu
Worker node CPU limit.
-
workerGroup.resources.limits.memory
Worker node memory limit.
-
workerGroup.resources.limits.gpuCount
(Optional) Worker node GPU limit.
-
workerGroup.affinity.gpuClass
(Optional) GPU class for workers.
-
workerGroup.affinity.region
(Optional) Region for workers.
-
Install the Helm Chart:
user@host:~/lumigator$ helm install inference-service ./infra/helm/inference-service
Note
Depending on the model you’re trying to deploy this step may take a while to complete.
Verify
Port-forward the Ray dashboard:
user@host:~/lumigator$ kubectl port-forward svc/inference-service-ray-dashboard 8265:8265
Navigate to
http://localhost:8265
to access the Ray dashboard. Check that the Ray Serve deployment is running and the GPU resources are allocated correctly.Note
The name of the service may vary depending on the name you’ve chosen for the Ray Service.
Port-forward the head node of the Ray cluster:
user@host:~/lumigator$ kubectl port-forward pod/inference-service-ray-head-0 8080:8000
Note
The name of the pod may vary depending on the name you’ve chosen for the Ray Service.
Invoke the model:
user@host:~/lumigator$ curl "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{"model": "<model-name>", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}]}'
The response should be a JSON object with the model’s prediction.
Note
Replace
<model-name>
with the name of the model you’ve deployed.
Conclusion
You’ve successfully deployed a vLLM-supported model using Ray Serve on a Kubernetes cluster. You can now use the model for inference and explore advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.