dbt coalesce 2022 recap

Author

Benoît Goujon

Data engineer at Artefact France

Read our article on

This year’s edition was taking place in New Orleans. And as in the past editions, we learned a ton about the analytics engineering landscape.

The event organized by dbt was back this year. You could physically attend in New Orleans or watch the talks online.

As dbt adoption is rising, we were expecting a lot from this conference. Sessions on different topics that were not limited to the use of dbt were proposed. For instance, there were sessions about career tracks for data teams.

Without further delay, here are the key lessons from this edition in my opinion:

you can now write your models in Python
the dbt cloud UI and IDE have been revamped for a much better developer experience
dbt introduced its own version of the semantic layer
dbt aims to be at the heart of the modern data ecosystem

Let’s dive into the details.

Python models, finally!

It certainly was the most expected feature. You can now execute Python models. The behavior is very similar to SQL models.

This feature is game-changing. I think we are a lot to experience the same issue with a workflow we can’t run end to end because of one or two operations that are very tricky to do in SQL. This is painful because we need an extra layer. We don’t want to manage this back and forth between dbt and another component.

This was the case in particular for advanced statistics, text manipulation, and everything that is ML-related (feature engineering, data enrichment …). Those edge cases are the target use cases of Python models. Product managers have been very clear during the keynote that it will be for basic use cases that imply data transformations. Calling external APIs is not recommended.

So, how does it work?

First, similar to SQL models, the code will be executed on your cloud data platform.

Second, in the same way as SQL models, you must adapt your syntax depending on the underlying cloud platform. In SQL, you need to use the appropriate SQL dialect. In Python, you have a different set of libraries that will be available.

The feature is available on three data platforms as of today:

Snowflake
BigQuery
Databricks

For example, if you use Snowflake, you can leverage snowpark for your transformations. Note that the feature is still in the early days as mentioned by Eda Johnson and Venkatesh Sekar in their talk “Empowering pythonistas with dbt and snowpark”. snowpark is still in public preview.

As stated during the keynote, there is room for improvement to get closer to the experience of a Python software engineer (facilitate code reuse across models, provide test capabilities, and use docstrings for documentation …).

A lot of improvements for dbt cloud

A few months ago, a blog post entitled “We need to talk about dbt”, written by Petram Navid made waves. Tristan Handy, the CEO of dbt labs, replied to Pedram’s concerns, especially the ones about dbt cloud. Indeed, in the original blog post, the long-time dbt practitioner pointed out the poor experience he had on dbt cloud. Tristan agreed that they should work hard to improve the developer experience.

And they did! This week, dbt Labs announced a complete revamp of the cloud IDE, UI improvements, and a reduction of the latency for common operations such as saving a file.

This will be good news for dbt cloud adopters!

The semantic layer is a structural shift in the way you manage your data

This is a hot topic!

During the keynote the speakers defined the semantic layer as “the “platform for compiling and accessing dbt assets in downstream tools”.

The semantic layer aims to solve common data governance challenges:

the lack of proper access management
the duplication of data assets, which results in technical debt and inconsistency between your KPIs
the lack of documentation of your data assets, which is coupled with discoverability issues

The goal here is to extend the scope of dbt. For now, the scope is limited to the transformation layer. We could add this semantic layer on top of the transformation layer.

This makes sense. In version 1.0, metrics had been introduced. This was the first step toward the vision of a semantics layer.

dbt at the heart of the modern data stack ecosystem

What hit me during this conference is the number of partnerships announced. Also, a majority of the talks were given by partners.

Software vendors like Atlan, Collibra, or MonteCarlo need to integrate to dbt because their customers asked them to. dbt is slowly becoming the standard for data transformation. You want to see your transformations in your global data lineage that might be managed with an external tool like Collibra. You also want to monitor the results of your dbt tests with your preferred tool etc. You need integration between your tools.

Unlike dataform, the only competitor to dbt as of today, I have the feeling that dbt labs wants to remain cloud-neutral. They offer a lot of integrations with niche solutions to better manage your data quality or your metadata for example.

Conclusion

That’s a wrap! This edition was very rich. And we end this week with a lot of discussions about the announcements. That’s what makes this job exciting!

Speaking of which, we hire at Artefact! I’m sure you didn’t see it coming 😉

Medium Blog by Artefact.

This article was initially published on Medium.com.
Follow us on our Medium Blog !

Read Our Article