vLLM Ray Serve deployment

This guide provides step-by-step instructions to deploy a vLLM-supported model using Ray Serve on a Kubernetes cluster. The deployment exposes an OpenAI-compatible API for inference and supports advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.

Prerequisites

A Kubernetes cluster with KubeRay installed.
GPU-enabled nodes for efficient inference.

Procedure

Navigate to the inference-service Helm Chart, under the infra directory and configure your deployment, using the values.yaml file. Here’s a list of the options:

Ray Service configuration:

Parameter	Description	Default
`inferenceServiceName`	Name of the Ray Service (max 63 chars, DNS-compliant).	-
`inferenceServiceNamespace`	Namespace where the Ray Service is deployed.	`default`

Application configuration:

Parameter	Description	Default
`name`	Application name (e.g., deepseek-v3, llama-3).	-
`routePrefix`	Route prefix for serving (e.g., “/”).	`/`
`importPath`	Import path for the model.	`lumigator.jobs.vllm.serve:model`
`numReplicas`	Number of replicas.	`1`
`numCpus`	CPUs allocated per Ray actor.	-
`workingDir`	Model working directory.	-
`pip`	List of Python packages.	`["vllm==0.7.2"]`
`modelID`	Model path (Hugging Face ID or local directory).	-
`servedModelName`	Name of the served model.	-
`tensorParallelism`	Tensor parallelism (GPUs per node).	-
`pipelineParallelism`	Pipeline parallelism (number of nodes).	-
`dType`	Data type (`float32`, `float16`, `bfloat16`, etc.).	-
`gpuMemoryUtilization`	Fraction of GPU memory allocated for inference.	`0.80`
`distributedExecutorBackend`	Executor backend (`ray` or `mp`).	`ray`
`trustRemoteCode`	Allow custom code from Hugging Face Hub.	`true`

Ray Cluster configuration:

Parameter	Description	Default
`image`	Docker image for Ray nodes.	`rayproject/ray:2.41.0-py311-gpu`
`dashboardHost`	Ray dashboard host.	`0.0.0.0`
`objectStoreMemory`	Object store memory allocation (bytes).	`1000000000`
`persistentVolumeClaimName`	(Optional) PVC for model storage.	-
`headGroup.resources.limits.cpu`	Head node CPU limit.	-
`headGroup.resources.limits.memory`	Head node memory limit.	-
`headGroup.resources.limits.gpuCount`	(Optional) Head node GPU limit.	-
`headGroup.affinity.gpuClass`	(Optional) GPU class for head node.	-
`headGroup.affinity.region`	(Optional) Region for head node.	-
`workerGroup.groupName`	Worker group name.	`worker-group`
`workerGroup.replicas`	Number of worker group replicas.	`1`
`workerGroup.resources.limits.cpu`	Worker node CPU limit.	-
`workerGroup.resources.limits.memory`	Worker node memory limit.	-
`workerGroup.resources.limits.gpuCount`	(Optional) Worker node GPU limit.	-
`workerGroup.affinity.gpuClass`	(Optional) GPU class for workers.	-
`workerGroup.affinity.region`	(Optional) Region for workers.	-

Install the Helm Chart:
```
user@host:~/lumigator$ helm install inference-service ./infra/helm/inference-service
```
Note

Depending on the model you’re trying to deploy this step may take a while to complete.

Verify

Port-forward the Ray dashboard:
```
user@host:~/lumigator$ kubectl port-forward svc/inference-service-ray-dashboard 8265:8265
```
Navigate to http://localhost:8265 to access the Ray dashboard. Check that the Ray Serve deployment is running and the GPU resources are allocated correctly.

Note

The name of the service may vary depending on the name you’ve chosen for the Ray Service.
Port-forward the head node of the Ray cluster:
```
user@host:~/lumigator$ kubectl port-forward pod/inference-service-ray-head-0 8080:8000
```
Note

The name of the pod may vary depending on the name you’ve chosen for the Ray Service.

Invoke the model:

user@host:~/lumigator$ curl "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{"model": "<model-name>", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}]}'

The response should be a JSON object with the model’s prediction.

Note

Replace <model-name> with the name of the model you’ve deployed.

Conclusion

You’ve successfully deployed a vLLM-supported model using Ray Serve on a Kubernetes cluster. You can now use the model for inference and explore advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.