Read our article on

.

This article is the third part of a series in which we go through the process of logging models using Mlflow, serving them on Kubernetes engine and finally scaling them up according to our application needs. Although this article could be used independently to test any API response, we recommend reading our two previous articles (part1 and part2) on how to deploy a tracking instance and serve a model as an API with Mlflow. In the following, we will be interested in the scalability issue and address it with few experiments to understand k8s cluster behavior and give recommendations on how to handle high loads.

Part 3 – How to handle high loads and make our application scalable?

Introduction

In a classic scenario where a machine learning model is deployed behind an application or a product, multiple users could interact with it simultaneously to generate predictions. Therefore, it is essential to analyze our infrastructure capabilities and dimension it accordingly. This becomes particularly interesting as far as Kubernetes is concerned, as it could impact decisions on whether to use autoscaling or not, the maximum nodes number to consider…

In this context, charge tests allow to simulate multiple simultaneous or incremental numbers of requests and monitor the infrastructure behavior (response time, CPU usage, memory usage..) in order to correctly dimension resources and avoid bottlenecks. Those tests will be performed here using a tool called Locust.

Environment preparation

The requirements for this Hands-on are detailed in the first article of this series but as a summary, here are the main elements we need specifically for this part supposing that our model is already deployed as an API on a Kubernetes cluster (mlflow-k8s).

For this part of the hands-on, we will need:

  • A GKE cluster to deploy Locust (here we will name it load_testing)
  • A configured local workstation (gcloud, kubectl)
  • The following environment variable exported

    export GCR_REPO=eu.gcr.io/mlflow-on-k8s/repo
  • The repository where the hands-on code lives

Deployment

1. Build Locust docker image and push the Locust image to GCR

cd mlflow-serving-exampledocker build --tag ${GCR_REPO}/locust-tasks:v1 
file dockerfile_locust .docker push ${GCR_REPO}/locust-tasks:v1

2. Prepare the test task

Tasks are python functions that Locust will execute on its workers as part of the load test, in the example code provided under locust-tasks/tasks.py we just need to send a POST request to the API with a data row to get predictions.

In this code snippet :

  • on_start: is executed only once when the thread is launched to download the dataset.

  • post_metrics: is the core of our testing task, here we have only one function that sends one row to the /invocation endpoint.

We can create as many functions as tests that we want to perform. For example, we can add one to send data batches. Also, we can use the @task() decorator to give priority to the different tasks.

3. Deploy to Kubernetes

Now it’s time to deploy the image and run Locust on its dedicated cluster. First, make sure that the context is set on the load_testing cluster by running

kubectl config get-contexts
kubectl config use-context NAME

Next, we can update our deployment file deployments/locust_load_test.yaml by specifying the image path on GCRand pointing the TARGET_HOST to the API address.

kind: ReplicationController
apiVersion: v1
metadata:
name: locust-master
labels:
name: locust
role: master
spec:
replicas: 1
selector:
name: locust
role: master
template:
metadata:
labels:
name: locust
role: master
spec:
containers:
– name: locust
image: GCR_REPO/locust-tasks:v1 # Change here
env:
– name: LOCUST_MODE
value: master
– name: TARGET_HOST
value: ‘http://SERVING_IP:SERVING_PORT’ # Change here
ports:
– name: loc-master-web
containerPort: 8089
protocol: TCP
– name: loc-master-p1
containerPort: 5557
protocol: TCP
– name: loc-master-p2
containerPort: 5558
protocol: TCP

kind: ReplicationController
apiVersion: v1
metadata:
name: locust-worker
labels:
name: locust
role: worker
spec:
replicas: 30
selector:
name: locust
role: worker
template:
metadata:
labels:
name: locust
role: worker
spec:
containers:
– name: locust
image: GCR_REPO/locust-tasks:v1 # Change here
env:
– name: LOCUST_MODE
value: worker
– name: LOCUST_MASTER
value: locust-master
– name: TARGET_HOST
value: ‘http://SERVING_IP:SERVING_PORT’ # Change here

kind: Service
apiVersion: v1
metadata:
name: locust-master
labels:
name: locust
role: master
spec:
ports:
– port: 8089
targetPort: loc-master-web
protocol: TCP
name: loc-master-web
– port: 5557
targetPort: loc-master-p1
protocol: TCP
name: loc-master-p1
– port: 5558
targetPort: loc-master-p2
protocol: TCP
name: loc-master-p2
selector:
name: locust
role: master
type: LoadBalancer
kind: ReplicationController apiVersion: v1 metadata: name: locust-master labels: name: locust role: master spec: replicas: 1 selector: name: locust role: master template: metadata: labels: name: locust role: master spec: containers: - name: locust image: GCR_REPO/locust-tasks:v1 # Change here env: - name: LOCUST_MODE value: master - name: TARGET_HOST value: 'http://SERVING_IP:SERVING_PORT' # Change here ports: - name: loc-master-web containerPort: 8089 protocol: TCP - name: loc-master-p1 containerPort: 5557 protocol: TCP - name: loc-master-p2 containerPort: 5558 protocol: TCP --- kind: ReplicationController apiVersion: v1 metadata: name: locust-worker labels: name: locust role: worker spec: replicas: 30 selector: name: locust role: worker template: metadata: labels: name: locust role: worker spec: containers: - name: locust image: GCR_REPO/locust-tasks:v1 # Change here env: - name: LOCUST_MODE value: worker - name: LOCUST_MASTER value: locust-master - name: TARGET_HOST value: 'http://SERVING_IP:SERVING_PORT' # Change here --- kind: Service apiVersion: v1 metadata: name: locust-master labels: name: locust role: master spec: ports: - port: 8089 targetPort: loc-master-web protocol: TCP name: loc-master-web - port: 5557 targetPort: loc-master-p1 protocol: TCP name: loc-master-p1 - port: 5558 targetPort: loc-master-p2 protocol: TCP name: loc-master-p2 selector: name: locust role: master type: LoadBalancer

Finally, let’s deploy it using the following command.

kubectl create -f deployments/locust_load_test.yaml

The Locust instance should be now up and a new load balancer should have been created. We can find its IP by typing kubectl get services and access the interface using the LoadbalancerIP:8089

Experimentation

The idea is to use Locust to simulate parallel queries on our serving API and analyze the cluster’s behavior and response time (median in green and 95th percentile orange). This is done for educational purposes to highlight two features that Kubernetes offers which are horizontal and vertical (auto)scaling.

1. Manual scaling

In the first experiment, we try to understand the effect of having more pods serving our models. We start with one pod and try to increase the number of requests. In the graph below, we can differentiate 4 phases with different configurations and charges.

As a general takeaway, we can see that it’s important to always monitor resources metrics (CPU, RAM..) to identify bottlenecks and configuration issues. In our case, having just one pod didn’t allow us to profit from the available processing power. So, when deploying an application, it’s essential to set a suitable number of pods and set enough resources per pod to maximize the machine usage taking into consideration the system services running in the backend. Thus we recommend not to push the nodes’ CPU usage higher than 80–90%.

2. Horizontal auto-scaling

Well, fortunately, Kubernetes has an auto horizontal scaling feature to automatically monitor CPU usage and create new pods when necessary to distribute charge. This could be simply activated by the following command.

kubectl autoscale deployment mlflow-serving --cpu-percent=80 --min=1 --max=12

We can then monitor the pods’ number and states using kubectl get hpa mlflow-serving, analyze cluster response time and resources consumption.
The goal of the following experiment is to observe how Kubernetes can automatically add pods to optimize resources usage and have a better response time. We can split this experiment into three phases as shown in the graph below.

In this second experiment, we noticed that horizontal auto-scaling enabled us to decrease response time by creating new pods and allocating more cluster resources. However, when reaching the cluster capacity (phase3) new pods remain in a pending state and our response time increases again.

3. Vertical auto-scaling

In such a situation, we can explore another Kubernetes feature known as vertical auto-scaling which consists in allocating more nodes whenever it is needed. This feature could be activated using the following command specifying the number of minimum and maximum nodes Kubernetes can allocate.

gcloud container clusters update mlflow-k8s 
--enable-autoscaling  --min-nodes 3 --max-nodes 5 --node-pool POOL_NAME

Finally, in this last experiment summarized in the graph below, enabling the vertical auto-scaling feature, allowed Kubernetes to automatically add two new nodes and create new pods to dispatch the load and ensure lower response time. Actually, it took around 1 min for Kubernetes to detect the need and create the resources (phase 2). Moreover, with lower load (phase 3) Kubernetes managed to free up the two new nodes by killing pods and scale down the cluster to a minimum of three nodes in around 15 min.

4. Cluster size estimation

Now that we have understood how Kubernetes behave in response to different charge levels using vertical and horizontal auto-scaling features, the ultimate step is to perform performance tests with different resources, taking into consideration the requirements of our application and the estimation of its users’ number. Let’s imagine that, to meet our SLA requirements our 95th percentile response time should be lower than 1 sec. In this case, we can plot the graph below showing the API response time for different cores numbers and get an idea about the performance of our application in different conditions.

In particular, for our ML model served with Mlflow, we can have around 120 simultaneous users on 12 cores Kubernetes cluster and guarantee a response time under 1 sec.

Conclusion

In series of articles, we went through the whole process to deploy Mlflow tracking instance and to serve a model as an API on Kubernetes engine getting advantage of its capability to easily scale up and handle high loads. We also experimented with two interesting features that Kubernetes offers which are the horizontal and vertical auto-scaling and showed that it’s always interesting to monitor our resources to make sure that we are efficiently using them. Finally, we showed how we could test our application and make decisions regarding the infrastructure based on its response to different test scenarios.

Medium Blog by Artefact.

This article was initially published on Medium.com.
Follow us on our Medium Blog !

Artefact Newsletter

Interested in Data Consulting | Data & Digital Marketing | Digital Commerce ?
Read our monthly newsletter to get actionable advice, insights, business cases, from all our data experts around the world!