The power of modulo in data analysis
The so-called modulo operator exists in different programming languages. But how can you use this operator, and what for? In this article, I will show you three practical use cases.
by Salvatore Cagliari
What is modulo?
In short, the modulo operator returns the remaining of a division. The modulo operation is written as "A modulo B", "A mod B" or "A % B" (A being the dividend and B the divisor). For example, the result of "5 mod 2" would be 1, because 5 divided by 2 has a quotient of 2 and a remainder of 1. "9 mod 3" would evaluate to 0, because the division of 9 by 3 has a quotient of 3 and a remainder of 0.
Many programming languages have an operator or a function to calculate modulo. For example, T-SQL has the % operator, and DAX has the MOD() function.
In case you don’t have a way to calculate modulo directly, you can use the following formula to calculate the modulo B from A:
A — A / B * B à A ( ( A / B) * B)
In T-SQL, the following expression both return the same result: 11
But, what is so interesting about modulo? When you have a sequential series of numbers, and you calculate modulo 4 for each number, the result starts from 0, goes up to 3 and starts over again:
This effect opens up a lot of possibilities. Let’s look at three examples from my work with my clients, where modulo offered an easy solution for a problem which would usually be much more complex to solve.
One of my clients asked me the following question: “How can I arbitrarily segment my customers?”
With modulo, the solution was easy: Each customer had a numeric Customer ID. By calculating modulo for each customer, I was able to easily assign each of them to an arbitrary segment. In case you don’t have a numeric ID, you can add an index column ordered by whatever column(s) you want and calculate a modulo based on this to get the segmentation:
Another client asked me: “How can I aggregate data based on time to four segments per hour?”
Minute 01–14 → 00
Minute 15–29 → 15
Minute 30–44 → 30
Minute 45–60 → 45
After having sorted it in this way, I could aggregate the data, delivered in real-time, by means of the 15min_Slice column to 15 minutes.
With the grouping of their data, my client was able to reduce the amount of data to store to the 15 minute level, which was the maximum needed.
When you are working with Power BI, it could be helpful to store the least number of distinct values in your tables.
Consider a list of events with a column for the values.
With a very large number of events, you may have billions or trillions of different values.
Now, you can reduce the number of distinct values with the help of modulo:
1. Define the maximum number of distinct values (for example 10’000).
2. Divide the sales amount by 10’000 and store the result as an integer.
3. Calculate the modulo by 10’000 and store the result as a decimal number.
The result is a list with a lower number of distinct values, which can be compressed much more efficiently than the original billions of different values.
To retrieve the result, you need to multiply the result of the division with 10’000 and add the result of modulo.
As I don’t have such a dataset, I used my example dataset to reduce the amount of distinct values from almost 3’000 to 412 values and 693 values from modulo. As Power BI stores and compresses its data per column, this is a good outcome.
Here is the SQL code to calculate the values to store:
SELECT CAST([SalesAmount] / 3.0 AS int) AS [Division] ,[SalesAmount] % 3 AS [Modulo] FROM [dbo].[V_FactOnlineSales];
The result looks like this:
And with this SQL code, you can retrieve the original value:
SELECT CAST(([Division] * 3.0) + [Modulo] AS decimal(10, 4)) AS [Result] FROM [Store];
The result of this query consists of the exact same numbers as the original data:
To use this approach, you must carefully test it with your data, as the wrong conversion of the results will lead to wrong results when you try to retrieve the original values.
Modulo provides excellent solutions to certain problems. In each case, however, the results need to be tested and validated against expectations. The third example in particular can lead to great improvements in performance but also to wrong results if not done correctly.
Apart from that, using modulo also requires a certain openness to solve problems in a different way and to look at its options.
In all three cases I have described here, modulo has definitely helped me to find simple and effective solutions to challenges that would normally require much more complex calculations.
Do you want to look into the topic of business analytics in more depth? Then click here to find out more!