vLLM Ray Serve deployment
This guide provides step-by-step instructions to deploy a vLLM-supported model using Ray Serve on a Kubernetes cluster. The deployment exposes an OpenAI-compatible API for inference and supports advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.
Prerequisites
A Kubernetes cluster with KubeRay installed.
GPU-enabled nodes for efficient inference.
Procedure
Navigate to the
inference-serviceHelm Chart, under theinfradirectory and configure your deployment, using thevalues.yamlfile. Here’s a list of the options:Ray Service configuration:
Parameter
Description
Default
inferenceServiceNameName of the Ray Service (max 63 chars, DNS-compliant).
-
inferenceServiceNamespaceNamespace where the Ray Service is deployed.
defaultApplication configuration:
Parameter
Description
Default
nameApplication name (e.g., deepseek-v3, llama-3).
-
routePrefixRoute prefix for serving (e.g., “/”).
/importPathImport path for the model.
lumigator.jobs.vllm.serve:modelnumReplicasNumber of replicas.
1numCpusCPUs allocated per Ray actor.
-
workingDirModel working directory.
-
pipList of Python packages.
["vllm==0.7.2"]modelIDModel path (Hugging Face ID or local directory).
-
servedModelNameName of the served model.
-
tensorParallelismTensor parallelism (GPUs per node).
-
pipelineParallelismPipeline parallelism (number of nodes).
-
dTypeData type (
float32,float16,bfloat16, etc.).-
gpuMemoryUtilizationFraction of GPU memory allocated for inference.
0.80distributedExecutorBackendExecutor backend (
rayormp).raytrustRemoteCodeAllow custom code from Hugging Face Hub.
trueRay Cluster configuration:
Parameter
Description
Default
imageDocker image for Ray nodes.
rayproject/ray:2.41.0-py311-gpudashboardHostRay dashboard host.
0.0.0.0objectStoreMemoryObject store memory allocation (bytes).
1000000000persistentVolumeClaimName(Optional) PVC for model storage.
-
headGroup.resources.limits.cpuHead node CPU limit.
-
headGroup.resources.limits.memoryHead node memory limit.
-
headGroup.resources.limits.gpuCount(Optional) Head node GPU limit.
-
headGroup.affinity.gpuClass(Optional) GPU class for head node.
-
headGroup.affinity.region(Optional) Region for head node.
-
workerGroup.groupNameWorker group name.
worker-groupworkerGroup.replicasNumber of worker group replicas.
1workerGroup.resources.limits.cpuWorker node CPU limit.
-
workerGroup.resources.limits.memoryWorker node memory limit.
-
workerGroup.resources.limits.gpuCount(Optional) Worker node GPU limit.
-
workerGroup.affinity.gpuClass(Optional) GPU class for workers.
-
workerGroup.affinity.region(Optional) Region for workers.
-
Install the Helm Chart:
user@host:~/lumigator$ helm install inference-service ./infra/helm/inference-service
Note
Depending on the model you’re trying to deploy this step may take a while to complete.
Verify
Port-forward the Ray dashboard:
user@host:~/lumigator$ kubectl port-forward svc/inference-service-ray-dashboard 8265:8265
Navigate to
http://localhost:8265to access the Ray dashboard. Check that the Ray Serve deployment is running and the GPU resources are allocated correctly.Note
The name of the service may vary depending on the name you’ve chosen for the Ray Service.
Port-forward the head node of the Ray cluster:
user@host:~/lumigator$ kubectl port-forward pod/inference-service-ray-head-0 8080:8000
Note
The name of the pod may vary depending on the name you’ve chosen for the Ray Service.
Invoke the model:
user@host:~/lumigator$ curl "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{"model": "<model-name>", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}]}'
The response should be a JSON object with the model’s prediction.
Note
Replace
<model-name>with the name of the model you’ve deployed.
Conclusion
You’ve successfully deployed a vLLM-supported model using Ray Serve on a Kubernetes cluster. You can now use the model for inference and explore advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.