String filters in pandas: you’re doing it wrong

Author

Mael Deschamps

Senior Machine Learning Engineer at Artefact

Read our article on

Filtering data using ID == ‘string’ in Pandas is something you should avoid as the scalar_compare operator leads to performance bottlenecks. There is many way to bypass it, for example by partitioning your Dataframe into a dictionary using the ID in question as a key.

Introduction | String filters in pandas

The simplification of hardware management, brought by cloud solutions, pushes us more and more to turn away from the problems of code optimisation.

But scaling up and increasing computing power is not always the solution as it leads to an escalation of costs and computing power is not infinite.

By giving myself the challenge to fit all my data preparation on a simple container, I quickly realised that with the knowledge of a few tricks, optimising your code can sometimes be just as simple as instancing a bigger instance.

In this article I’d like to come back to a single change of code that allowed me to reduce drastically the time spent on the features calculation during the development phase of a propensity model. This change is common enough to be applied to many other situations.

It doesn’t aim to be the most optimal solution, but seek to be a quick option to decrease the computation time in an efficient way, in the spirit of the Pareto principle.

Context

During this mission, I was in charge of automating the data preparation process and the prediction of models developed by our Data Scientist team. To simplify the data-flow, transactional data was uploaded everyday and had to go through a first Preprocessing step, followed by a second step of Feature Computation before reaching the last step of Model Prediction using the trained model.

The Feature computation phase is the one that took the longest time to execute: indeed many features were computed at the customer level, which resulted in the recurrent execution of a surprisingly time-consuming line in the code :

which resulted in the recurrent execution of a surprisingly time-consuming line in the code :

This single line was timed 18ms, which meant that with my 33.717 clients to evaluate daily, I was spending around 10 min of raw computation time per customer level features, reduced to 1 min and 16 second per feature thanks to the parallelisation of the operation on my 8 available CPUs.

As we were working with B2B data, it was necessary to compute features at the customer level, as one customer was representing in fact one business with sometimes multiple orders per day.

Experimentation | String filters in pandas

After investigating a little using the %%prun magic command, I was able to identify the source of this processing bottleneck : the pandas._libs.ops.scalar_compare operator which was under-optimised in my version of Pandas (1.3.1).

By simply replacing this “==” operator by “isin”, which is not very intuitive as I was comparing a single string, I already divided the computing time by 2.5 times, going from 18ms by operation to 7.95ms.

Still looking for optimisation, I came across a Stackoverflow post pushing the use of Categorical type to improve further the operation.

This last implementation allowed me to divide the computing time more than 36 times. I could however observe a nuance to this trick, as the category type does not behave like a classic str during all operations (see example bellow when using .groupby() in pandas) so I had to convert it back to str at one point.

Hypothesis

But why is that ? How can a simple == operation between two string takes more time than an isin() operation comparing lists, or a categorical one ?
Well, to answer the first question we would definitely need to dig out the code behind the execution of scalar_compare which uses mostly Cython and compare it to the code behind the isin() method.

Fortunately, the answer to the second part seems to be more intuitive: when comparing two string values, we are comparing an infinite number of possibilities together, whereas when comparing two categories, the number of options is set by the different unique categories that exist. It seems much easier to compare two entities when our number of options is fixed.

My thirst for optimisation was still not quenched, I decided to take a step back from the current method. I came up with a new approach: partitioning my Dataframe into a dictionary that I will then use to filter my customers when computing my features.
In terms of code, this simply translated into the following lines:

customers_list = list(df.ID_Customer.unique()) df_dict = {elem: df[df.ID_Customer == elem] for elem in customers_list}

Building that dictionary cost me 32 seconds of computing time, but using this partitioned Dataframe I was now able to filter my data in a few nanoseconds.

Conclusion | String filters in pandas

After spending a couple of hours in the experimentation phase, I was happy with the result :

The initial computing time per customer filtering was now divided 348 000 times, going from 18ms to 51.7ns, or from 10min to 2.65ms per feature computed in my case, taking into account the time spend on the partitioning.

Immediately, the impact of this small change allowed me to reduce the calculation time of my complete Feature computation phase by 90%, from 40’49” to 7’27”. Using a CO2eq estimation method that I will detail in my next article, this modification saved at least 170$/year + 22kgCO2/year and potentially way more with the growing customer list and the roll-out of the project in other countries.

Medium Blog by Artefact.

This article was initially published on Medium.com.
Follow us on our Medium Blog !

Read Our Article