In this second article of the series of two, I will dive into the deployment and the serving of our models at scale. If you missed the first one about training a fastai model at scale on AI Platform Training, here is the link.
Serving a deep learning model can reveal several challenges among which:
scaling resources on instances with or without accelerators (NVIDIA GPUs)
In this article, I will explain how I served a deep learning text classifier trained with the FastAI library following 2 main steps:
Deploy fastai model using TorchServe
Host serving on GCP AI Platform Prediction
1- Deploy fastai model using TorchServe
TorchServe makes it easy to deploy PyTorch models at scale in production environments. It removes the heavy lifting of developing your own client server architecture. The FastAI library is based on the PyTorch framework. It makes it possible to use this technology to serve fastai models by loading your fastai model as a pure pytorch object (remove fastai abstraction).
1–1 Export Model Weights from FastAI
To do that, you need to restore the FastAI learner from the export pickle from the last post, and save its model weights with PyTorch.
from fastai.text import load_learner
from fastai.text.learner import get_c, get_text_vocab
learn = load_learner(“fastai_cls.pkl”)
vocab_sz = len(_get_text_vocab(dls)) #dls is the dataloader you used for training
n_class = get_c(dls)
config = awd_lstm_clas_config.copy()
1–2 PyTorch Model from FastAI
Once you’ve exported your pytorch weights, you need to rebuild the model structure to be able to load your weights into it. You might have to dig a little bit in fastai source code to find your implementation but luckily, In Jupyter notebook, one can investigate the source code by adding ?? in front of a function name.
For text classifier, you can load a pure pytorch object by using the fastai get_text_classifier function
from fastai.text.all import AWD_LSTM
torch_pure_model = get_text_classifier(AWD_LSTM, vocab_sz, n_class, config=config)
1–3 Reproduce fastai preprocessing steps
Once you have obtained your pytorch pure model, you need to apply the same preprocessing that was used for training. FastAI has a very handy method .predict that can be applied to a text (simple string object), that naturally reproduces training preprocessing and therefore removes risk of training serving skew.
pred_fastai = learn.predict(text)
>>(Category tensor(1), tensor(1), tensor([0.0036, 0.9964]))
In our case, we have to take this responsibility ourselves, since we need to get rid of fastai abstraction and work directly with PyTorch objects.
In my example, I used a spacy tokenizer so I reproduced fastai preprocessing as shown below:
from fastai.text.core import Tokenizer, SpacyTokenizerfrom fastai.text.data import Numericalize
example = “Hello, this is a test.”
tokenizer = Tokenizer(
numericalizer = Numericalize(vocab=vocab)
example_processed = numericalizer(tokenizer(example))
>>> tensor([ 4, 7, 26, 29, 16, 72, 69, 31])
inputs = example_processed.resize(1, len(example_processed))
outputs = model_torch.forward(inputs) preds = torch.softmax(outputs, dim=-1) #You can use any activation function you need
>>> tensor([[0.0036, 0.9964]], grad_fn=)
As you can notice, the results I get using torch functions and learn.predict are the same because I managed to preserve the same preprocessing steps.
1–4 Deploy your model via torchserve
In this section we deploy the PyTorch model to TorchServe. For installation, please refer to TorchServe Github Repository.
Overall, there are mainly 3 steps to use TorchServe:
- Archive the model into *.mar.
- Start the torchserve.
Call the API and get the response.
In order to archive the model, at least 2 files are needed in our case:
- PyTorch model weights fastai_cls_weights.pth.
- TorchServe custom handler.
As shown in /deployment/handler.py, the TorchServe handler accepts data and context. In our example, we define another helper Python class with 4 instance methods to implement: initialize, preprocess, inference and postprocess.
Now it’s ready to setup and launch TorchServe.
TorchServe in Action
Step 1: Archive the model PyTorch
—- extra-files=/home/model-server/config.py,/home/model-server/vocab.json \
Step 2: Serve the Model
Step 3: Call API and Get the Response (here we use curl).
“Tensor”: [0.0036, 0.9964] }
The first call would have longer latency due to model weights loading defined in initialize, but this will be mitigated from the second call onward.
2- Deployment to AI Platform Prediction
In this section we deploy the FastAI trained model with TorchServe in GCP AI Platform Prediction using a customized Docker image. For more details about GCP AI Platform Prediction routines using custom containers please refer to this article. Note that this option is only available if you use AI Platform Prediction with regional endpoints.
Steps to deploy a fastai model on AI Platform Prediction:
First, create an AI Platform Prediction model on a regional endpoint:
–region=REGION \ #eg: europe-west1
2–1 Build your docker image that will be used by your version
Create a folder model/ in the root of the repository
Place your fastai model weights in model/text/ and name it fastai_cls_weights.pth
Create an artifact repository
–location=REGION #eg: europe-west1
Build your docker image
2–2 (Optional) Check that your docker image runs fine
Run your docker image locally and test it
curl -X POST -H “Content-Type: application/json” -d ‘[“this was a bad movie”]’ 127.0.0.1:8080/predictions/fastai_model
“Tensor”: [0.0036, 0.9964] }
2–3 Push your docker image to a container registry in your GCP project
You need to have the IAM credentials to do that. Once you’ve ensured you have them, run the following
docker push REGION-docker.pkg.dev/PROJECT_ID/ARTIFACT_REGISTRY_NAME/fastai_text_cls:v0
2–4 Create a model version using your docker image
2–5 Test your model version
-H “Authorization: Bearer $(gcloud auth print-access-token)” \
-H “Content-Type: application/json”
-d ‘[“this was a bad movie”]’
“Tensor”: [0.0036, 0.9964] }
Your fastai model is now deployed in a serverless architecture on AI Platform Prediction. You can make online predictions by sending requests to your model as a REST API. All methods to request predictions can be found in google documentation.
Using AI Platform Prediction to serve any type of model can be very useful. This article was aimed to show an example of a deep learning model using a heavy framework (pytorch) and serve it in a cost effective way.
Some limitations are to keep in mind:
Even with autoscaling, it is not possible to downscale to 0 instances when you use AI Platform models deployed on regional endpoints. Since that’s the only option to use custom containers, you’ll always have at least one instance up
Another explored option was to use custom routines rather than custom containers but you can only do so if your model and packaged code are below a limit size of 500 MB which in our case was not possible to achieve.