Today I wanted to look at an approach for producing aggregate data from multiple measurements over a source. I’m learning Kotlin at the moment so I’ll use that for the examples in this post, but we can apply the same idea to pretty much any language (I’ve used similar approaches in F#, and it would work with C# albeit with a bit more code noise). Any feedback on the approach in general and on my Kotlin-ing attempts is appreciated.
For this post we’ll consider the example of a list of
Sample values we want aggregate information for. A
Sample includes the month and year it was collected, and an integer representing the value sampled. For each set of samples we are required to show the following information:
- The total value sampled for each month and year
- The earliest sample date in this data set
- The largest individual sample collected
- A count of how many samples where within a specific range.
We could neatly get each individual bit of information by using multiple queries1, but requiring multiple iterations seems quite wasteful, especially for larger data sets. Instead we could use multiple variables, or an aggregate type containing those variables, and update each as we loop or fold over the data set:
This seems a reasonable approach to me, and we’ll take this and adapt it in an attempt to get a few additional benefits:
- Include more information about the type of calculation used for each field in the aggregate
- Enable reuse of specific calculations in other aggregates
- Enable independent testing of each calculation type
- Make it fairly simple to change existing aggregates, and to create new ones.
Representing aggregate calculations
First we’ll create a type to represent values that can be combined. We’ll use Kotlin’s
plus operator for this purpose.
We’ll steal the term “semigroup” from mathematics as its definition includes the constraints our
plus operation needs2, although we could also call it
Addable or something else if we prefer.
If you haven’t used Kotlin before, defining a
plus operator function lets us also use the
+ symbol, so
a + b will get translated to
a.plus(b). Whenever you see two semigroups being added using
+ for the remainder of this post, keep in mind it will be calling the
plus function defined by that semigroup instance. (If you don’t like co-opting
+ in this way feel free to change the interface to declare
fun combine(other: T): T) or similar.)
Next, we’ll define instances that represent sum, max, and min aggregation:
Looking at our
CandidateAggregate from earlier, we also need to handle nullable values (
earliestSampleDate: MonthYear?), as well as combining
Map<MonthYear, Int> values. Rather than building these specifically for this case, we can express these concepts more generally in terms of other semigroups, so they can be reused for different cases:
Each of these operations is implemented quite similarly to the code we used for each field in
CandidateAggregate, but now we can reuse them for different aggregates, as well as test each in isolation. The cost is we have now spread this code across more types.
We can also write some general functions,
concatMap, to combine any list of
Semigroup<T> values into a single
Semigroup<T> value, effectively combining aggregates3. Here is an example of how to define and use these functions (as well as an example of testing
Max in isolation):
Using our aggregation types
Now we can rewrite
CandidateAggregate using our aggregation types:
The type of aggregation used appears explicitly for each field in
Aggregate. For example
largestSample: Max<Int> conveys both the type of the result (
Int), as well as the process being used to calculated it (
CandidateAggregate only the former was expressed. We also build some field types by composing semigroups, such as
Mapped<MonthYear, Sum>, which specifies we will be adding values using
Sum rather than some other approach. This also makes it very simple to update the method of aggregation (as illustrated below).
We have made
Aggregate itself a semigroup to define how we combine these composite aggregates. We’ve also added an
empty property to make it easier to call
The last piece we need is to translate a single
Sample into an
Aggregate, then we can do the entire aggregation using
concatMap as shown in the
aggregateSamples() test. Each
Sample gets transformed into an
Aggregate representing that individual sample (an aggregate of 1), then each
Aggregate in turn gets combined to calculate the required information across all the samples.
What have we gained for the price?
This definitely has more pieces that the
CandidateAggregate version (although the code for each piece has not changed much, it is now spread over multiple types). More pieces suggest a performance impact, but I have not measured this.
We do get a few benefits for this price. Firstly, we now have some small, simple, genuinely reusable aggregation types (
Mapped etc.). These can be combined into other aggregates, and they can be tested in isolation. Secondly, we explicitly define aggregate types in terms of the aggregates of which they are composed. We don’t have an aggregate that contains an
Int, we have a
Sum or a
Max<Int> which conveys more information as to the aggregation process, as well as preventing errors (summing two
Int values that should have been combined using
maxOf for example).
We also make it simpler to change our aggregation. For example, if we wanted to change from reporting the total value to the maximum value for each month, we can change
Mapped<MonthYear, Sum> to
Mapped<MonthYear, Max<Int>> and the aggregation process will adjust accordingly.
We introduced a
Semigroup<T> interface which represents values that can be combined with an associative, binary operation. We also introduced
concatMap operations that work for any instance of this interface. We created
Mapped instances of this interface to represent common methods of aggregation, then built a custom
Aggregate semigroup composed of some of these instances.
This is a bit more complex compared than manually aggregating a set of values over a loop or fold, but in return gives us reusable and testable aggregate types, more communicative types for our aggregate model, less opportunities for bugs in the aggregation process, as well as making the creation of new aggregates and modifications to existing aggregates simpler.
- Understanding monoids, a three part series on monoids (a special case of semigroup) by Scott Wlaschin at the excellent F# for fun and profit site.
Example of multiple queries:↩
A semigroup for a type
Tconsists of a closed binary operation
T -> T -> Tthat is also associative (i.e.
a + (b + c) == (a + b) + c). This associativity constraint means we can combine and compose these values fairly flexibly. For example, we can do
a + b + c, without having to worry about wether
bis itself a composite of
x + y, as associativity guarantees
a + (x + y) + cis the same as
((a + x) + y) + c. We can’t do the same thing with non-associative operations like subtraction:
100 - (30 - 10) - 5 /= ((100 - 30) - 10) - 5 75 /= 55
The end result is we can use associativity to combine values without having to also take evaluation order into account.↩
empty: Tvalue for cases where the
itemslists are empty. We could use a
Monoidconstraint instead of
Semigroup, which adds the concept of an empty identity element, but I found this messy to implement in Kotlin.↩