30 March 2021
In this second article of the series of two, I will dive into the deployment and the serving of our models at scale. If you missed the first one about training a fastai model at scale on AI Platform Training, here is the link.


In this second article of the series of two, I will dive into the deployment and the serving of our models at scale. If you missed the first one about training a fastai model at scale on AI Platform Training, here is the link.

Serving a deep learning model can reveal several challenges among which:

  • scaling resources on instances with or without accelerators (NVIDIA GPUs)

  • cost efficiency

In this article, I will explain how I served a deep learning text classifier trained with the FastAI library following 2 main steps:

  • Deploy fastai model using TorchServe

  • Host serving on GCP AI Platform Prediction

All materials can be found in the github repository. This repository was inspired by another project that aimed to deploy a fastai image classifier on AWS SageMaker Inference Endpoint here.

1- Deploy fastai model using TorchServe

TorchServe makes it easy to deploy PyTorch models at scale in production environments. It removes the heavy lifting of developing your own client server architecture. The FastAI library is based on the PyTorch framework. It makes it possible to use this technology to serve fastai models by loading your fastai model as a pure pytorch object (remove fastai abstraction).

1–1 Export Model Weights from FastAI

To do that, you need to restore the FastAI learner from the export pickle from the last post, and save its model weights with PyTorch.

import torch
from fastai.text import load_learner
from fastai.text.learner import get_c, get_text_vocab
learn = load_learner(“fastai_cls.pkl”)
vocab_sz = len(_get_text_vocab(dls)) #dls is the dataloader you used for training
n_class = get_c(dls)
config = awd_lstm_clas_config.copy()
torch.save(learn.model.state_dict(), “fastai_cls_weights.pth”)

1–2 PyTorch Model from FastAI

Once you’ve exported your pytorch weights, you need to rebuild the model structure to be able to load your weights into it. You might have to dig a little bit in fastai source code to find your implementation but luckily, In Jupyter notebook, one can investigate the source code by adding ?? in front of a function name.

For text classifier, you can load a pure pytorch object by using the fastai get_text_classifier function

from fastai.text.learner import get_text_classifier
from fastai.text.all import AWD_LSTM
torch_pure_model = get_text_classifier(AWD_LSTM, vocab_sz, n_class, config=config)

1–3 Reproduce fastai preprocessing steps

Once you have obtained your pytorch pure model, you need to apply the same preprocessing that was used for training. FastAI has a very handy method .predict that can be applied to a text (simple string object), that naturally reproduces training preprocessing and therefore removes risk of training serving skew.

text = “This was a very good movie”
pred_fastai = learn.predict(text)
>>(Category tensor(1), tensor(1), tensor([0.0036, 0.9964]))

In our case, we have to take this responsibility ourselves, since we need to get rid of fastai abstraction and work directly with PyTorch objects.

In my example, I used a spacy tokenizer so I reproduced fastai preprocessing as shown below:
import torch

import torch
from fastai.text.core import Tokenizer, SpacyTokenizerfrom fastai.text.data import Numericalize
example = “Hello, this is a test.”
tokenizer = Tokenizer(
numericalizer = Numericalize(vocab=vocab)
example_processed = numericalizer(tokenizer(example))
>>> tensor([ 4, 7, 26, 29, 16, 72, 69, 31])
inputs = example_processed.resize(1, len(example_processed))
outputs = model_torch.forward(inputs)[0] preds = torch.softmax(outputs, dim=-1) #You can use any activation function you need
>>> tensor([[0.0036, 0.9964]], grad_fn=)

As you can notice, the results I get using torch functions and learn.predict are the same because I managed to preserve the same preprocessing steps.

1–4 Deploy your model via torchserve

In this section we deploy the PyTorch model to TorchServe. For installation, please refer to TorchServe Github Repository.
Overall, there are mainly 3 steps to use TorchServe:

  1. Archive the model into *.mar.
  2. Start the torchserve.

Call the API and get the response.
In order to archive the model, at least 2 files are needed in our case:

  1. PyTorch model weights fastai_cls_weights.pth.
  2. TorchServe custom handler.

Custom Handler

As shown in /deployment/handler.py, the TorchServe handler accepts data and context. In our example, we define another helper Python class with 4 instance methods to implement: initialize, preprocess, inference and postprocess.

Now it’s ready to setup and launch TorchServe.

TorchServe in Action

Step 1: Archive the model PyTorch
torch-model-archiver \
—-model-name=fastai_model \
–version=1.0 \
–serialized-file=/home/model-server/fastai_cls_weights.pth \
—- extra-files=/home/model-server/config.py,/home/model-server/vocab.json \
–handler=/home/model-server/handler.py \
Step 2: Serve the Model
torchserve –start –ncs –model-store model_store –models fastai_model.mar
Step 3: Call API and Get the Response (here we use curl).
curl -X POST -H “Content-Type: application/json” -d ‘[“this was a bad movie”]’
“Categories”: “1”,
“Tensor”: [0.0036, 0.9964] }

The first call would have longer latency due to model weights loading defined in initialize, but this will be mitigated from the second call onward.

2- Deployment to AI Platform Prediction

In this section we deploy the FastAI trained model with TorchServe in GCP AI Platform Prediction using a customized Docker image. For more details about GCP AI Platform Prediction routines using custom containers please refer to this article. Note that this option is only available if you use AI Platform Prediction with regional endpoints.

Steps to deploy a fastai model on AI Platform Prediction:

First, create an AI Platform Prediction model on a regional endpoint:

gcloud beta ai-platform models create MODEL_NAME \ #eg: fastai_text_clf
–region=REGION \ #eg: europe-west1
–enable-logging \

2–1 Build your docker image that will be used by your version

  • Create a folder model/ in the root of the repository

  • Place your fastai model weights in model/text/ and name it fastai_cls_weights.pth

  • Create an artifact repository

gcloud beta artifacts repositories create ARTIFACT_REGISTRY_NAME \ #eg: getting-started-fastai
–repository-format=docker \
–location=REGION #eg: europe-west1
  • Build your docker image

docker build -f TextDockerfile -t REGION-docker.pkg.dev/PROJECT_ID/ARTIFACT_REGISTRY_NAME/fastai_text_cls:v0

2–2 (Optional) Check that your docker image runs fine

  • Run your docker image locally and test it

docker run -it -p 8080:8080 REGION-docker.pkg.dev/PROJECT_ID/ARTIFACT_REGISTRY_NAME/fastai_text_cls:v0
curl -X POST -H “Content-Type: application/json” -d ‘[“this was a bad movie”]’
“Categories”: “1”,
“Tensor”: [0.0036, 0.9964] }

2–3 Push your docker image to a container registry in your GCP project

You need to have the IAM credentials to do that. Once you’ve ensured you have them, run the following

gcloud auth configure-docker
docker push REGION-docker.pkg.dev/PROJECT_ID/ARTIFACT_REGISTRY_NAME/fastai_text_cls:v0

2–4 Create a model version using your docker image

gcloud beta ai-platform versions create VERSION_NAME \

2–5 Test your model version

curl -X POST \
-H “Authorization: Bearer $(gcloud auth print-access-token)” \
-H “Content-Type: application/json”
-d ‘[“this was a bad movie”]’
“Categories”: “1”,
“Tensor”: [0.0036, 0.9964] }

Your fastai model is now deployed in a serverless architecture on AI Platform Prediction. You can make online predictions by sending requests to your model as a REST API. All methods to request predictions can be found in google documentation.


Using AI Platform Prediction to serve any type of model can be very useful. This article was aimed to show an example of a deep learning model using a heavy framework (pytorch) and serve it in a cost effective way.

Some limitations are to keep in mind:

  • Even with autoscaling, it is not possible to downscale to 0 instances when you use AI Platform models deployed on regional endpoints. Since that’s the only option to use custom containers, you’ll always have at least one instance up

  • Another explored option was to use custom routines rather than custom containers but you can only do so if your model and packaged code are below a limit size of 500 MB which in our case was not possible to achieve.

You can find more about us and our projects on our Medium blog

Artefact Newsletter

Interested in Data Consulting | Data & Digital Marketing | Digital Commerce ?
Read our monthly newsletter to get actionable advice, insights, business cases, from all our data experts around the world!