Rolling Sums & Averages Grouped By Key
Informatica provides two rolling transformations, MOVINGSUM and MOVINGAVG, which combine the last N rows’ values automatically. These transformations have three crucial limitations:
- The first N-1 rows do not return any values
- There is no in-built way to reset them when the grouping key changes
- The parameter controlling the number of rows cannot be changed during execution
All of these limitations can be overcome. This blog post explains how.
In a recent Smartbridge project, this issue came to a head as two different ETL requirements (in completely separate sub-projects) had a rolling average component.
First, on the Finance sub-project, we were required to provide, together with the current quarter’s financial metrics, the average of those same results over the previous year. The straightforward approach we initially favored was to perform an unconnected lookup that returned the results by quarter, and invoke it three times per row. This was an adequate solution (leveraging the unconnected cache, it meant we were only reading each row once, but using it thrice: in each of the successive 3 quarters), but it was still one more read of the source table than we liked, since we already had all those rows in the pipeline anyway.
A week later, though, on the Operations sub-project, we started reading a new source of data (Big Data, in fact) that represented tens of thousands of data points per operational key. The requirement was to provide not just a plot of these data points, but also to perform a basic data-fitting curve: a rolling average of the last 100 values. Our solution to the Finance issue could theoretically be scaled that far, but even a cursory examination of the effort involved quickly determined that it should not be scaled. While technically you could invoke an unconnected lookup 100 times per row with no impact to source, both the code and, we suspected, the ETL itself would not be equally accommodating. We had to use rolling averages, and we needed to do so for all rows, while restarting the average whenever the operational key (of which there are thousands) changed.
The PowerCenter Approach
I will save you all the promising yet ultimately flawed approaches I tried and rejected. The best solution is to combine the rolling sum function with the expression transformation’s ability to retain the last row’s values. By keeping a row count of the number of rows since the last key change, you can determine when you can start using the rolling average, and in the meantime, depend on a sum (or sum/number of rows).
1. Create a $$Target_Num_Rows parameter that will establish the number of rows to roll. While not strictly necessary, this will allow the value to be tweaked easily, if needed.
2. In the mapping, add a sorter that will order the rows by grouping key, and secondarily by the rolling order. In my experience, this will be the grouping key and a date. Make sure that the grouping key is above the rolling order field.
3. Following the sorter, add an expression transformation, with the following fields:
- Grouping Key input-output
- Value input(-output)
- Current number of rows variable
- Current accumulated Value variable
- Rolling Value Output
- Previous Grouping Key variable (if the grouping key is complex, consider also setting up a “Key has changed” flag)
Note: This expression, unlike an aggregator, is passive, so any fields not required are best routed around it, to reduce its memory footprint.
4. The code in the expression should be:
Current number of rows
IIF(Grouping_Key != Prev_Grouping_Key, 1, Current_Num_Rows + 1)
Current Accumulated Value:
IIF(Grouping_Key != Prev_Grouping_Key, Value, Current_Acc_Value + Value)
- If performing a Rolling Sum:
IIF(Current_Num_Rows <= $$Target_Num_Rows, Current_Acc_Value,MOVINGSUM(Value, $$Target_Num_Rows))
- If performing a Rolling Average:
IIF(Current_Num_Rows <= $$Target_Num_Rows, Current_Acc_Value/Current_Num_Rows,MOVINGAVG(Value, $$Target_Num_Rows))
Previous Grouping Key:
The above set-up will perform two parallel calculations. On one hand, rolling sum (or average) will produce the result you want, once the minimum row threshold has been achieved, with the usual PowerCenter ease of use. But more importantly, while waiting to reach the minimum row threshold, you will be doing a regular value accumulation, and performing resets of the accumulation whenever the key has changed.
With this code, there is no need to read the source more than once, significantly improving performance when averaging large sources.
Like so many algorithms, looking back on it, it seems like a perfectly obvious solution to the original problem. Its design hinges on realizing that the two processes have opposite weaknesses: the rolling functions are useless before reaching their threshold, and the accumulated sum cannot be used after the same threshold, since you cannot subtract values last seen 100 rows before. By using both in tandem, their problems are cancelled out and an elegant solution arises in their midst.