Ready-to-use SQL template explained
Let’s go to the heart of the matter with the expected query:
Now, let’s explain it:
First, we use CTE — Common Table Expressions (WITH … ASstatements) to identify the two datasets to be compared. Thereby, the following query lines can stay unchanged.
Then, we construct two similar queries aliased dataset_1_only and dataset_2_only in order to keep rows that are only present in dataset 1 (compared with dataset 2) and vice-versa thanks to the EXCEPT set operator.
Finally, a flag complementary field is defined in order to identify the source of mismatching rows from both queries. These are then gathered via the UNION ALL set operator.
Note: obviously, having two datasets with comparable fields is a prerequisite to use the above query
Enough with theory, now to practice! Suppose a supermarket wants to retrieve all products eligible for a promotion in a single table according to the week promotion conditions.
As an initial statement, we consider below products in stock:
According to the promotion conditions in week 1:
10% discount on fruits, vegetables and herbs shelves
the dataset_1 is:
In week 2, the promotion conditions evolve to:
10% discount on fruits and ice-creams shelves
20% discount on vegetables shelf
no discount on herbs shelf
The data pipeline evolution generates the dataset_2:
But… are we really confident with the results? Typical questions are:
are there any missing or extra-rows?
are implemented evolutions correct on impacted fields?
is there any regression on non-impacted fields?
Note: even if it seems easy to answer the above questions intuitively due to a trivial example on a small dataset, in real use cases we usually face plenty of rows and columns from complex queries (transformations, joins, aggregates, windows functions, …) letting this query take on its full meaning
Let’s try to answer the 3 questions interpreting the comparison query results:
No row with products from the fruits shelf: fine as the “10% discount on fruits” is remained unchanged from week 1 to week 2
Product_ID #50 (herbs shelf) is missing from dataset_2 (only in dataset_1): fine as the “discount on herbs” was stopped on week 2
Product_ID #60 (ice-creams shelf) is missing from dataset_1 (only in dataset_2): fine as the “discount on ice-creams” was introduced on week 2
Rows concerning Product_ID #30 and #40 (vegetables shelf) are diverging on reduced_price field: fine as the “discount on vegetables” increased from 10% on week 1 to 20% on week 2
Finally, everything looks good:
are there any missing or extra-rows? No!
are implemented evolutions correct on impacted fields? Yes!
is there any regression on non-impacted fields? No!
Tips and tricks
In CTE, even if you can copy-paste your ingestion SQL query before/after evolution, it could be a good idea to store the results in temporary tables to simplify the comparison, improve the query performances and benefit from caching if you run it multiple times.
In real-world, divergences between two datasets can be messy. Then, sorting results by key (the granularity of the datasets) and flag can hugely help interpreting and comparing equivalent rows coming from both datasets.
To ease the investigations and identify the gap origin, the suspicious datasets common fields can be removed (i.e. commented in CTE) from comparison: if there is no result, it means all compared fields are equal. You can then focus on remaining suspicious fields only for the next comparison and make it step-by-step.
Note: an equivalent analysis could have been performed using LEFT JOIN strategy but would have been much more difficult to maintain (NULLs & fields comparison management) and less efficient as set operators are more powerful than joins.
This templated query therefore makes it easier for developers to quickly validate changes on complex data pipelines. It is useful as a complement to more traditional unit tests, and can even be considered in a more general way when comparing any two datasets with similar structures. Last but not least, the logic of this query is even easy to learn by heart!
Thank you for reading, I hope it was clear and I would be glad to hear your feedback :)