Read our article on


This article provides a guide for deploying Stable Diffusion model, a popular image generation model, on Google Cloud using Vertex AI. The guide covers Setup and weights download, TorchServe for deployment, TorchServe server deployed on a Vertex endpoint and Saving the image history automatically on GCS


Stable Diffusion is an image generation model. It was open sourced in 2022 and has gained popularity due to its ability to generate high quality images from text descriptions. Like other image generation models, such as Dall-E, Stable Diffusion uses machine learning techniques to generate images based on a given input.

Moët Hennessy, the wine and spirits division of luxury conglomerate LVMH, manages a portfolio of more than 26 iconic brands such as Moët & Chandon, Hennessy, and Veuve Clicquot. Moët Hennessy has collaborated with Artefact to investigate the potential uses of cutting-edge technology in marketing content generation. With a focus on privacy and security, the team decided to explore the deployment of Stable Diffusion on Google Cloud Platform (GCP) to allow Moët Hennessy to fine-tune and run the model within their own infrastructure, providing a seamless experience from model fine-tuning to API exposure.

Before you begin, it is important to note that this article assumes that you have prior knowledge of Google Cloud Platform (GCP) and specifically Vertex AI. This includes concepts such as model registry and Vertex endpoints. Additionally, you need to have prior experience with Docker to follow some of the steps. If you are not familiar with these concepts, it is recommended that you familiarize yourself with them before proceeding.
Additionally, in order to download Stable Diffusion weights you need to have a huggingface account, if you don’t have one already you can create one easily on huggingface’s website.
With that being said, let’s begin !

Setup and weights download

I recommend cloning the github repository I prepared in order to follow the steps of the article.

It is always important to create a virtual environment to install the packages. Personally, I will use anaconda and install the requirements.txt with every dependency:

conda create -n stable_diffusion — no-default-packages python=3.8 -y \
conda activate stable_diffusion \
pip install -r src/requirements.txt

You are now ready to download the weights of Stable Diffusion, in this article we will use Stable Diffusion 1.5. You need to accept the license on the model page, else you will have errors when downloading the model weights.
Go to your account → settings → access token → new token (read access)

Go to your account → settings → access token → new token (read access)

You can add the token as an environment variable. Personally, I recommend using a .env file and loading the environment variable using the python-dotenv library.
You have to navigate to `src/stable_diffusion` and run:


This will download the weights inside


Torchserve framework

Torchserve is a framework for serving PyTorch models. It allows you to deploy PyTorch models in a production environment and provides features such as model versioning and multi-model serving. It is designed to be easy to use and allows you to focus on building and deploying your models, rather than worrying about infrastructure.

1. Creating a handler respecting TorchServe format

A custom handler is a Python class that defines how to pre-process input data, how to run the model, and how to post-process the output. To create a custom handler for your model, you will need to create a Python class that follows the TorchServe format.
A custom handler for Stable Diffusion is already given inside the TorchServe repository. But Vertex endpoint expects a specific format for the requests, so we need to adapt the preprocess() method of the handler to account for Vertex format. You can use the modified version of the handler named `` given inside the github repository of this article.

2. Creating the .mar file

Once you have created your custom handler, you will need to package it along with any dependencies and the model itself into a .mar file using the model-archiver tool. The model-archiver tool is a command-line tool that allows you to package your model, handler, and dependencies into a single file.

This will create a .mar file called output.mar that contains your model, handler, and dependencies.

You might need to edit the path based on where the files are on your machine:

torch-model-archiver \
–model-name stable-diffusion \
–version 1.0 \
–handler stable_diffusion/ \
–export-path stable_diffusion/model-store \
–extra-files stable_diffusion/external_files \

3. Running your TorchServe server

Once you have created your .mar file, you can start the TorchServe server using the torchserve command. To do this, you will need to run the following command:

torchserve \
— start \
— \
— models=stable-diffusion.mar \
— model-store=stable_diffusion/model-store

The let you specify the configurations of your TorchServe server, such as port for inference, health checks, number of workers, etc..
WARNING: All the scripts need to be run where the file is located to avoid any path errors.

Run the TorchServe server locally

It is important to test the code locally before starting to dockerize our deployment, I already prepared a bash script that will create the .mar file and start the TorchServe server.

You can run it with :


You need to wait a few minutes for the server to initialize and the worker to load the model, you can start to run inference if you see the following log:

2023–01–05T15:34:52,842 [DEBUG] W-9000-stable-diffusion_1.0 \
org.pytorch.serve.wlm.WorkerThread— W-9000-stable-diffusion_1.0 \

You can then use the following code to make inference requests to your model:

import requests
prompt = “a photo of an astronaut riding a horse on mars”
URL = “http://localhost:7080/predictions/stable-diffusion”
response =, data=prompt)

You can check that the server received your request by looking at the server logs:

2023–01–05T15:35:43,765 [INFO ] W-9000-stable-diffusion_1.0-stdout \
MODEL_LOG — Backend received inference at: 1672929343

Stable Diffusion requires a GPU to run smoothly, so you won’t have any output for the moment. You can stop the torchserve server with `torhcserve –stop`.

Dockerize torchserve

Your TorchServe server is working locally, to deploy Stable Diffusion on Vertex AI, you will need to dockerize it. This means creating a Docker image that contains the model, the custom handler, and all the necessary dependencies. This is simply all the steps we have done above inside a Dockerfile.
Luckily, it is already prepared and ready to use here : Dockerfile.

It’s important to run the container locally to check that it is working properly. I’m going to build it locally, but you can build it with cloud build and pull it on your machine.

Build the image locally (you need Docker daemon running):

build -t serve_sd .

The image build will take 20 to 30 minutes, the build phase is long because the weights of the model need to be copied inside the image before being packaged by the model-archiver.

You can run a docker container and listen on port 7080 with:

docker run -p 7080:7080 serve_sd

To check that everything is working properly, you can wait until the worker has loaded the model and run the same inference code as before since we are using the same 7080 port.

Deployment on Vertex AI

Now that the Dockerfile is ready and working, we need to:

  • Build the image using Cloud Build inside google container registry (gcr)
  • Upload the image of our custom model inside Vertex AI model registry

  • Create a Vertex AI endpoint
  • Attach the model to the endpoint

This is exactly what the bash script is going to do, you can run it with:


The deployment takes around 1 hour with good internet speed:

  • Sending the 8 GB of model weights to Cloud Build can take from a few minutes to hours depending on your internet speed
  • Building the image takes around 20 minutes

  • Uploading the model takes around 5 minutes

  • Creating the endpoint takes around 5 minutes

  • Attaching the model to the endpoint takes from 30 to 40 minutes

After the model is successfully attached to the endpoint, you can query the endpoint with the following code:

from import aiplatform as aip

aip.init(project=PROJECT_NAME, location=REGION)
endpoint = aip.Endpoint(endpoint_name=ENDPOINT_ID)
text_input = “””A bottle of aged and exclusive cognac
stands on a reflective surface, in front of a vibrant bar,
hyper detailed, 4K, bokeh”””

def query_endpoint(endpoint, text_input):
payload = {“data”: text_input}
response = endpoint.predict(instances=[payload])
return response

image = query_endpoint(endpoint, text_input)

A bottle of aged and exclusive cognac stands on a reflective surface, in front of a vibrant bar, hyper detailed, 4K, bokeh — Stable Diffusion 1.5

The inference should take between 10 and 15 seconds on T4 GPU. You can improve the speed by choosing a better GPU, you can change the ACCELERATOR_TYPE variable inside

Optional: If the image size is too big for the endpoint, or you want to store the image history.

You might want to keep track of the history of the images, or you are having errors due to the size limit of 1.5 MB for the endpoint response. In this case, I recommend using the post-process method of the handler to save the image inside GCS and return only the GCS path of the image.

Luckily, I have already prepared a handler that does just this for you, switching from to should do the trick.

WARNING: Before running the deployment with the new handler, you need to:

  • Create a GCS bucket that will store the images

  • Edit the name of the new GCS bucket and folder inside src/stable_diffusion/external_files/

  • Change the service account used by the endpoint inside the file. You need a service account with GCS OWNER permissions.


In this article, a comprehensive guide has been provided for deploying the Stable Diffusion model on Google Cloud Platform using Vertex AI.

The guide covered essential steps such as:

  • Using TorchServe for deployment
  • Creating and modifying a custom handler for Stable Diffusion
  • Deploying the model using Vertex model registry and Vertex Endpoints

It’s important to remember that while Vertex endpoint can be an effective solution, it does not support scaling down to 0 instances, which could lead to increased costs as the GPU remains in use.

Additionally, with the deployment of the Stable Diffusion model accomplished, we are currently exploring further possibilities such as fine-tuning the model on specific Moet Hennessy products to enhance the model capabilities even further.

Medium Blog by Artefact.

This article was initially published on
Follow us on our Medium Blog !

Artefact Newsletter

Interested in Data Consulting | Data & Digital Marketing | Digital Commerce ?
Read our monthly newsletter to get actionable advice, insights, business cases, from all our data experts around the world!