Ray Serve is a framework for deploying and serving machine learning and large language model (LLM) inference workloads. It is designed to be scalable, support complex multi-model workflows, and efficiently utilize costly resources such as GPUs. The Phi-3 model released by Microsoft is a capable LLM model that has been optimized for use with CPUs and low memory environments.
Learning how to use Ray Serve to deploy a large language model will benefit anyone working with machine learning models and looking to deploy them in a production environment.
In this hands-on lab, you will use a development environment to implement a Ray Serve deployment, and you will run your deployment on a virtual machine.
Learning objectives
Upon completion of this beginner-level lab, you will be able to:
- Implement a Ray Serve deployment that allows you to interact with a large language model
- Test your deployment on a virtual machine
- Deploy your Ray Serve deployment to a Ray cluster
Intended audience
- Anyone looking to learn about deploying machine learning models
- Cloud Architects
- Data Engineers
- DevOps Engineers
- Machine Learning Engineers
- Software Engineers
Prerequisites
Familiarity with the following will be beneficial but is not required:
- Large Language Models
- The Python programming language
The following content can be used to fulfill the prerequisites:
Environment before
Environment after
Andrew is a Labs Developer with previous experience in the Internet Service Provider, Audio Streaming, and CryptoCurrency industries. He has also been a DevOps Engineer and enjoys working with CI/CD and Kubernetes.
He holds multiple AWS certifications including Solutions Architect Associate and Professional.