# vLLM Ray Serve deployment This guide provides step-by-step instructions to deploy a [vLLM](https://docs.vllm.ai/en/latest/)-supported model using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) on a Kubernetes cluster. The deployment exposes an OpenAI-compatible API for inference and supports advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules. ## Prerequisites - A Kubernetes cluster with [KubeRay](https://github.com/ray-project/kuberay) installed. - GPU-enabled nodes for efficient inference. ## Procedure 1. Navigate to the `inference-service` Helm Chart, under the `infra` directory and configure your deployment, using the `values.yaml` file. Here's a list of the options: Ray Service configuration: | Parameter | Description | Default | |-----------------------------|----------------------------------------------------------|-----------| | `inferenceServiceName` | Name of the Ray Service (max 63 chars, DNS-compliant). | - | | `inferenceServiceNamespace` | Namespace where the Ray Service is deployed. | `default` | Application configuration: | Parameter | Description | Default | |-----------------------------|-------------------------------------------------------------|-----------------------------------| | `name` | Application name (e.g., deepseek-v3, llama-3). | - | | `routePrefix` | Route prefix for serving (e.g., "/"). | `/` | | `importPath` | Import path for the model. | `lumigator.jobs.vllm.serve:model` | | `numReplicas` | Number of replicas. | `1` | | `numCpus` | CPUs allocated per Ray actor. | - | | `workingDir` | Model working directory. | - | | `pip` | List of Python packages. | `["vllm==0.7.2"]` | | `modelID` | Model path (Hugging Face ID or local directory). | - | | `servedModelName` | Name of the served model. | - | | `tensorParallelism` | Tensor parallelism (GPUs per node). | - | | `pipelineParallelism` | Pipeline parallelism (number of nodes). | - | | `dType` | Data type (`float32`, `float16`, `bfloat16`, etc.). | - | | `gpuMemoryUtilization` | Fraction of GPU memory allocated for inference. | `0.80` | | `distributedExecutorBackend`| Executor backend (`ray` or `mp`). | `ray` | | `trustRemoteCode` | Allow custom code from Hugging Face Hub. | `true` | Ray Cluster configuration: | Parameter | Description | Default | |-----------------------------------------|---------------------------------------------|-----------------------------------| | `image` | Docker image for Ray nodes. | `rayproject/ray:2.41.0-py311-gpu` | | `dashboardHost` | Ray dashboard host. | `0.0.0.0` | | `objectStoreMemory` | Object store memory allocation (bytes). | `1000000000` | | `persistentVolumeClaimName` | (Optional) PVC for model storage. | - | | `headGroup.resources.limits.cpu` | Head node CPU limit. | - | | `headGroup.resources.limits.memory` | Head node memory limit. | - | | `headGroup.resources.limits.gpuCount` | (Optional) Head node GPU limit. | - | | `headGroup.affinity.gpuClass` | (Optional) GPU class for head node. | - | | `headGroup.affinity.region` | (Optional) Region for head node. | - | | `workerGroup.groupName` | Worker group name. | `worker-group` | | `workerGroup.replicas` | Number of worker group replicas. | `1` | | `workerGroup.resources.limits.cpu` | Worker node CPU limit. | - | | `workerGroup.resources.limits.memory` | Worker node memory limit. | - | | `workerGroup.resources.limits.gpuCount` | (Optional) Worker node GPU limit. | - | | `workerGroup.affinity.gpuClass` | (Optional) GPU class for workers. | - | | `workerGroup.affinity.region` | (Optional) Region for workers. | - | 1. Install the Helm Chart: ```console user@host:~/lumigator$ helm install inference-service ./infra/helm/inference-service ``` ```{note} Depending on the model you're trying to deploy this step may take a while to complete. ``` ## Verify 1. Port-forward the Ray dashboard: ```console user@host:~/lumigator$ kubectl port-forward svc/inference-service-ray-dashboard 8265:8265 ``` Navigate to `http://localhost:8265` to access the Ray dashboard. Check that the Ray Serve deployment is running and the GPU resources are allocated correctly. ```{note} The name of the service may vary depending on the name you've chosen for the Ray Service. ``` 1. Port-forward the head node of the Ray cluster: ``` user@host:~/lumigator$ kubectl port-forward pod/inference-service-ray-head-0 8080:8000 ``` ```{note} The name of the pod may vary depending on the name you've chosen for the Ray Service. ``` 1. Invoke the model: ```console user@host:~/lumigator$ curl "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{"model": "", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}]}' ``` The response should be a JSON object with the model's prediction. ```{note} Replace `` with the name of the model you've deployed. ``` ## Conclusion You've successfully deployed a vLLM-supported model using Ray Serve on a Kubernetes cluster. You can now use the model for inference and explore advanced configurations like tensor parallelism, pipeline parallelism, and LoRA modules.