Author

Kais Laribi

Senior Data Scientist at Artefact

Read our article on

This article is the second part of a series in which we go through the process of logging models using Mlflow, serving them as an API endpoint, and finally scaling them up according to our application needs. We encourage you to read our previous article in which we show how to deploy a tracking instance on k8s and check the hands-on prerequisites (secrets, environment variables…) as we will continue to build upon them here.
In the following, we show how to serve a machine learning model that is already registered in Mlflow and expose it as an API endpoint on k8s.

Part 2— How to serve a model as an API on Kubernetes?

Introduction

It is obvious that tracking and optimizing models’ performance is an important part of creating ML models. Once done the next challenge is to integrate them into an application or a product in order to use their predictions. This is what we call models serving or inference. There are different frameworks and techniques that allow us to do it. Yet, here we will focus on Mlflow and we will show how efficient and straightforward it could be.

Build and deploy the serving image

The different configuration files used here are part of the hands-on repository Basically, we need to:

1. Prepare the Mlflow serving docker image and push it to the container registry on GCP.

cd mlflow-serving-exampledocker build --tag ${GCR_REPO}/mlflow_serving:v1 
--file docker_mlflow_serving .docker push ${GCR_REPO}/mlflow_serving:v1

2. Prepare the Kubernetes deployment file

by modifying the container section and map it to the docker image previously pushed to GCR, the model path and the serving port.

apiVersion: apps/v1
	kind: Deployment
	metadata:
	name: mlflow-serving
	labels:
	app: serve-ML-model-mlflow

	spec:
	replicas: 1
	selector:
	matchLabels:
	app: mlflow-serving
	template:
	metadata:
	labels:
	app: mlflow-serving

	spec:
	containers:
	– name: mlflow-serving
	image: GCR_REPO/mlflow_serving:v1 #change here
	env:
	– name: MODEL_URI
	value: “gs://../artifacts/../../artifacts/..“ #change here
	– name: SERVING_PORT
	value: “8082“
	– name: GOOGLE_APPLICATION_CREDENTIALS
	value: “/etc/secrets/keyfile.json“
	volumeMounts:
	– name: gcsfs-creds
	mountPath: “/etc/secrets“
	readOnly: true
	resources:
	requests:
	cpu: “1000m“


	volumes:
	– name: gcsfs-creds
	secret:
	secretName: gcsfs-creds
	items:
	– key: keyfile.json
	path: keyfile.json

3. Run deployment commands

kubectl create -f deployments/mlflow-serving/mlflow_serving.yaml

4. Expose the deployment for external access

With the following command, a new resource will be created to redirect external traffic to our API.

kubectl expose deployment mlflow-serving --port 8082 --type="LoadBalancer"

5. Check the deployment & query the endpoint

If the deployment is successful, mlflow-serving should be UP and one pod should be available. You can check that by typing kubectl get pods

The final step is to check the external IP address that was assigned to the load balancer redirecting traffic to our API container using kubectl get services and test the response to a few queries.

An example code to perform those queries could be found in the following notebook in which we load few rows of data, select features, convert them to JSON format, and send them in a post request to the API.
Now, combining the steps done in our previous and current articles our final architecture would look like the following:

Conclusion

In this article, we showed that we can easily deploy machine learning models as an API endpoint using the Mlflow serving module.

As you may notice, in our current deployment only one pod was created to serve the model. Although this works well for small applications where we don’t expect multiple parallel queries, it could quickly show its limits on others as a single pod has limited resources. Moreover, this way the application won’t be able to use more than one node computing power. In the following and last article of this series, we will address the scalability issue. We will first highlight the bottlenecks and try to solve them in order to get a scalable application that takes advantage of the power of our Kubernetes cluster.

Medium Blog by Artefact.

This article was initially published on Medium.com.
Follow us on our Medium Blog !

Read Our Article

Serving ML models at scale using Mlflow on Kubernetes – Part 2