Data & ML challenges for 2022

Author

Bruce Delattre

Data Scientist Manager at Artefact

Read our article on

Key 2021 data & ML trends… and what they mean for 2022

The year 2021 has been quite rich in data & AI-related news. And what’s next? In this article, we have selected a few stories and blog posts that we found insightful, took a step back, and tried to infer what to expect from those early “signs” for 2022.

This article has been made possible thanks to the inputs of Arthur Derennes, Robin Doumerc, Amale El Hamri, Benoît Goujon, Vincent Luciani and Hanania Ouazan.

1 — Taming the indecency of foundation models

2021 did have its share of new big models. After GPT-3 (Brown et al., 2020) the year before, you may have heard about CLIP or, more recently, Gopher. These “foundation models”, as Bommasani et al. (2021) call them (as their architecture is often re-used, slightly changed to adapt for a specific Machine Learning task, or as they are also often further fine-tuned via transfer learning), are continuing their journey and progress as there do not seem to be any limit to the number of parameters optimised or data leveraged to train them. What is interesting is that these models bring large productivity gains with them, leveraging, as Bommasani and co-authors notice, the combination of emergence and homogenization.

Let’s start with homogenization: not only do most models you see in the literature are adapted from these generic architectures (think about BERT which is ubiquitous these days); but often also practitioners do not change the architecture, they just fine-tune an available “big” model to a downstream task using transfer learning. This architecture “invariance” means that new improvements to one foundation model can easily flow to all its children models.

The emergence, next, comes from the way they handle training data. Trained under self-supervision, relying on raw data that has not been labelled in a specific way, they are starting to show that they can answer needs they were not designed for at first (a “zero-shot” capability). Complex machine learning tasks that suffer from very poor data availability may be better solved leveraging the “knowledge” these models’ extract from large chunks of data. We are still at the early stages and results are often more disturbing than successful, but GPT-3, for instance, directly learns to solve a task from a prompt it has not seen during training (at least, theoretically should not have seen…). This emergence of unplanned capabilities means that we might move towards more capable and general purpose machine learning.

These benefits do not come without structural changes. Since they are large by nature, the list of organisations and companies able to create such models is restricted. This definitely should boost the usage of machine learning through proprietary AI APIs or prompt interfaces, abstracting the training and maintenance of foundation models to engineers. On the other hand, as more models will depend on one single parent, we may expect more regulation, ethical and social inquiry into these models (as children inherit the bias of their foundation model). There will definitely be more and more value in working with talents knowing the capability, limits and biases hidden behind these interfaces, in one way or another… starting with their carbon footprint.

2 — Making AI sustainable

It is no surprise that these new forms of AI come with a high cost in terms of carbon emissions: Strubell et al. estimate that a single training of BERT on GPUs is roughly equivalent to a NY to SF flight, while Taddeo et al. evaluate a single GPT-3 training to emit as much CO2 as 49 cars during a year.

AI was first seen as a valuable tool to solve climate change related problems (see the many ideas from the NeurIPS “Tackling climate change with machine learning” 2019 workshop), but many experts are also pointing at its carbon footprint. “Sustainable AI”, as Aimee van Wynsberghe puts it, should encompass not only AI for sustainability but also the sustainability of AI (which should also not be limited to ecological concerns).

As Abhishek Gupta recommends, working in favour of sustainable AI means exploring new ways of working. TinyML could help us avoid the energy cost of wireless computation, while carbon awareness should help us understand in which geographical location we could best train and deploy our machine learning models. A more sensible usage of the existing hardware and services should also simply be everybody’s concern.

Whatever the solutions used to embrace sustainable AI, we expect decision makers to be more likely to think twice before launching AI projects. This raises the challenge of measuring machine learning environmental impact.

Machine learning development, in 2022, should be cadenced by more systematic reporting of CO2e next to performance metrics (see for instance codecarbon), more transparency from cloud providers (see GCP carbon footprint dashboard) and, above all, a deeper reflection on the benefits and costs of leveraging AI. The most convincing projects will be the ones adopting a holistic approach: not only quantifying the carbon footprint of computation but weighing it in front of the efficiency brought by these new products, not forgetting to take into account a potential rebound effect. Measuring the carbon footprint of these big models is not enough: we should take into account the whole end to end pipeline: training, deployment, monitoring and also its impact on people’s ways of working.

3 — Adding a touch of Zen to your MLOPs

This is important as the production side of Machine Learning gets more and more intricate and sophisticated. MLOPs did particularly continue to boom this year and had its fair share of innovations or buzzing concepts as Matt Turck explains. Think simply about features stores, streaming capabilities and all the DataOps initiatives we will cover just below.

While 2021 was yet again a booming year for MLOPs, we have also started to witness thoughtful criticism against its own very buzz. And the arguments are fair: the MLOPs landscape is barely legible, encompassing hundreds of concepts and tools, maybe often overkilling it, and one could reasonably argue that an average project will not necessarily need them all. The majority of “reasonable scale” companies that are not FAANG (i.e. no huge technical teams, no infinite ROI generated by AI, reasonable data volumes) should keep it simple.

It remains difficult to predict how this landscape will evolve: without any doubt we should expect more startups to appear while also some homogenization & consolidation behind big players. No or low code tools will certainly continue to grow and make those features available to everyone. However things may turn, we also really believe in the emergence in the next few years of open standards and a “canonical ML stacks” such as the one the AI Infrastructure Alliance is intending to build (disclaimer: Artefact is part of the Alliance).

So we wish you to add a touch of Zen to your MLOPs in 2022. It means, first, taking a step back and pruning your stack to what really matters: the efficiency of your machine learning models and productivity of your data scientists, for example with an “aggressively helpful” mentality as the one the Stitch Fix platform team has adopted. Then, as most of the antipatterns of a Machine Learning project seem to come from the data side, to work on consolidating the foundations of your project: how you source and process the data itself. As Ciro Greco puts it, data should indeed become a “first-class citizen” of your production stack.

4 — Making data more a product than a simple input

“It has always been about data” should declare 2021, with its renewed interest in it, as evidenced, of course, by the Data-Centric AI movement launched by Andrew Ng. Not only data is the fuel of your machine learning model performance but also where the issues come in, as unbalanced, biased or poorly labelled data will definitely have a detrimental impact on downstream algorithms. For one given & fixed model, we should thus be able to gain quality just by working on its input, the data.

What is interesting is that this movement should reconcile everyone along the value chain, from the data engineering side and its recent calls to nurture DataOps practises (and we, ourselves, took a real pleasure this year to include tools such as Great Expectations in all our projects) to the data scientists & analysts who will not lack sophisticated methodologies to refine the data at hand (augmentation, labelling, bias correction, sampling…). Of course, we think that this won’t be possible without a clear investment from the upper management and the application of explicit processes of data governance to first identify, then structure the different domains and their owners within the organisation.

This, combined with the fact that data will be more and more easy to move around thanks to initiatives such as Airbyte’s and the continuous improvement of data sharing technologies in our modern data stack would allow companies obviously to find new perspectives from the data itself, in parallel to what AI already brings in terms of automation and insights.

***

That’s it! In this period of New Year resolutions, we thus wish you to tame the indecency of foundation models, make AI sustainable, add a touch of Zen to your MLOPs and finally nurture your data as a product more than simply considering it as an input. And you? What surprised you the most last year? What do you expect will happen this year?

Medium Blog by Artefact.

This article was initially published on Medium.com.
Follow us on our Medium Blog !

Read Our Article