Many of us developers will know the feeling of struggling to balance time spent with users trying to understand their needs versus actually developing software. This is even more apparent in data science, as in order to build an effective system, a lot of domain knowledge of that system is required. Over the last couple of years working as an ML engineer with different data science teams I have often asked myself how I can separate the responsibilities of optimising model accuracy and building all the software required to make that model functional. My humble opinion is that data scientists should prioritise the accuracy of their models while ML Engineers should prioritise making sure that those models can be used by the wider business.
As a general rule of thumb in data science projects, the more iterations that you complete, the better. So let’s look at why including an ML engineer from day 1 will help you to iterate, and therefore increase your chances of building a successful system. To cover all aspects we’ll break down each reason into three major topics of data science projects: data, models and infrastructure.
Before getting into it, let me just define what I mean by iteration. In this article I am referring to end to end iterations of the complete product, often including steps of: data ingestion, preprocessing, model training & evaluation, provisioning infrastructure, etc. What I do not mean is a quick model iteration in a notebook with the tweak of a hyper parameter. If you are used to working in an agile framework, you can equally think of these iterations as project sprints.
Reason 1: Accelerate the initial POC delivery
Building a skeleton on which you can iterate on is the first priority and can be a lengthy process. This skeleton is usually a POC that contains your initial baseline model and a demo of an application or way of exploiting the output of the model.
An ML Engineer will help with:
Infrastructure: selecting compatible cloud resources (VM’s, connections to various data sources) and designing the cloud architecture are some initial considerations for the ML Engineer.
Data: getting the necessary data to start building a model and ensuring data availability from various sources with the option to develop new flows if necessary.
Models: ensuring that models being tested are in fact compatible with proposed cloud architecture for model deployment and technical requirements (e.g. latency, compute required, production environment requirements).
The ML Engineer can also help out in this phase defining software engineering best practices with version control, linting, code architecture, tests, etc.
Reason 2: Accelerate each iteration
Once you have achieved that initial build, the first couple of iterations are often difficult and slow. Speeding up iterations will allow smaller iterations with a single feature change — a much more effective way of developing than changing many things in a model before getting feedback.
Infrastructure: time can be saved optimising storage and compute infrastructure. During these iterations an ML engineer can look to version the infrastructure itself, with Infrastructure as Code (IaC) tooling like Terraform. Using IaC allows for automation of infrastructure deployment directly with CI/CD pipelines, accelerating integration of any changes that need to be made to the existing infrastructure, and the creation of different cloud environments (dev, staging, production). Also using specific cloud components can speed up your workflow, for example building images remotely using GCP’s Cloud Build.
Data: preprocessing pipelines may be built rapidly by data science teams in order to get quickly to modelling. An ML Engineer can help in this phase to streamline your processing queries, whether they be in sql, pandas, pyspark etc. Doing this early on can save a lot of time on iterations in the long run as this code is run a lot.
Models: complex model architectures can make for a lengthy training process. Also, when a data scientist refers to a “model” they may in fact be referring to a group of 100 models trained on different slices of the data, each with a SHAP explainer for deriving feature importance. An ML engineer can focus on how to parallelise the training pipeline, whether that be on a VM with multiprocessing in python, or distributing your workload across several nodes on the cloud. This itself can be iterated on, but major gains can be made here with surprisingly little effort. Automating deployment of your model with a CI/CD/CT pipeline also greatly speeds up your iterations and ensures repeatability.
Reason 3: Reduce the cost of each iteration
Having an engineer to monitor your project’s cloud budget is critical, especially for data intensive applications.
Infrastructure: cost is a major variable in the infrastructure selection equation. Once infrastructure is chosen, budget alerts can be put into place to ensure that costly components are monitored closely.
Data: intelligent queries and data storage can also significantly reduce costs of each iteration. For example, aggregating data should be done sparingly during model iterations.
Models: parallelising your training pipeline can also save you on uptimes of costly machines or runtime of serverless components.
Reason 4: Ensure repeatability & interpretability of each iteration
Achieving fast iterations with a quality feedback loop in your project is great, but if you can’t replay each of those scenarios, it’s not much use. Having a repeatable pipeline implicitly means that you should have some way of monitoring pipeline runs, to identify runs based on specific parameters or performance metrics to roll back to if necessary. Setting this up robustly during development helps data scientists to experiment freely (without the need for the infamous Untitled12.ipynb) and prepares the pipeline for production monitoring.
Infrastructure: linking the training code version to the infrastructure code version is the “extra mile” here, but is necessary to provide full rollback capabilities to a previous run. Ensuring repeatability and monitoring on a pipeline run basis is for me the essential first level of ML Ops that teams should strive for. Cloud platforms have services (GCP’s Vertex AI for example) that can be quick to set up, but you may also consider taking the “best of breed” approach using open source tooling. The tradeoff here is balancing greater functionality of specific open source tooling vs the increased complexity of the system’s overall infrastructure.
Data: saving all data objects at each stage in the pipeline. Depending on volumes, priority is to save train / test sets of each run.
Models: as above, saving all models for each run with all parameters and metrics necessary. Another tip is to log a comment on each run with what has been changed for that specific run to log all experiments during development, as you would with a git commit message.
Reason 5: Avoid at all costs the hectic “industrialisation” iterations
When data science emerged it was very exploratory and required a massive effort with a group of software engineers to deploy any model once it had demonstrated good performances on historical data. This “industrialisation” phase is a very painful experience as the development environment (flat files & python notebook) is very different to the production environment (automated data flows with production data & production coding environment). The most successful projects that I have worked on are those where we have been able to copy the production environment as closely as possible in dev early on. This will reduce time to production and allow you to iterate safely in dev, deploying to prod when you are happy with an iteration.
Infrastructure: emulating necessary production infrastructure in dev is not always easy and can be costly. This is where infrastructure as code is useful and can allow you easily to swap between environments.
Data: something that separates data science development with traditional software engineering or even data engineering is that data scientists require production data in dev. Sandbox data (excluding some data or including some synthetic test data) for regular data engineering is a good practice during development, but can be a big waste of time for data science and can have large impacts on the whole data science pipeline. Thus having read only access to production tables is something to start negotiating with your data team from day 1.
Models: from the start of the project only one model (or modelling approach) should be present in your production code. All experiments should stay in notebooks or separate temporary scripts in another folder. This avoids you from accumulating dead code in your production code base, and is easier to maintain or to onboard other devs.
In summary, building models and building the software surrounding those models should be two priorities from the start of each project. Having therefore separate streams with different responsibilities can help teams concentrate on both in parallel. The role of the ML Engineer is evolving day by day, and I’d love to hear your views on anything that I’ve missed !