Today I wanted to look at an approach for producing aggregate data from multiple measurements over a source. I’m learning Kotlin at the moment so I’ll use that for the examples in this post, but we can apply the same idea to pretty much any language (I’ve used similar approaches in F#, and it would work with C# albeit with a bit more code noise). Any feedback on the approach in general and on my Kotlin-ing attempts is appreciated.
Motivating example
For this post we’ll consider the example of a list of Sample
values we want aggregate information for. A Sample
includes the month and year it was collected, and an integer representing the value sampled. For each set of samples we are required to show the following information:
- The total value sampled for each month and year
- The earliest sample date in this data set
- The largest individual sample collected
- A count of how many samples where within a specific range.
Initial attempts
We could neatly get each individual bit of information by using multiple queries1, but requiring multiple iterations seems quite wasteful, especially for larger data sets. Instead we could use multiple variables, or an aggregate type containing those variables, and update each as we loop or fold over the data set:
This seems a reasonable approach to me, and we’ll take this and adapt it in an attempt to get a few additional benefits:
- Include more information about the type of calculation used for each field in the aggregate
- Enable reuse of specific calculations in other aggregates
- Enable independent testing of each calculation type
- Make it fairly simple to change existing aggregates, and to create new ones.
Representing aggregate calculations
First we’ll create a type to represent values that can be combined. We’ll use Kotlin’s plus
operator for this purpose.
We’ll steal the term “semigroup” from mathematics as its definition includes the constraints our plus
operation needs2, although we could also call it Combinable
or Addable
or something else if we prefer.
If you haven’t used Kotlin before, defining a plus
operator function lets us also use the +
symbol, so a + b
will get translated to a.plus(b)
. Whenever you see two semigroups being added using +
for the remainder of this post, keep in mind it will be calling the plus
function defined by that semigroup instance. (If you don’t like co-opting +
in this way feel free to change the interface to declare fun combine(other: T): T)
or similar.)
Next, we’ll define instances that represent sum, max, and min aggregation:
Looking at our CandidateAggregate
from earlier, we also need to handle nullable values (earliestSampleDate: MonthYear?
), as well as combining Map<MonthYear, Int>
values. Rather than building these specifically for this case, we can express these concepts more generally in terms of other semigroups, so they can be reused for different cases:
Each of these operations is implemented quite similarly to the code we used for each field in CandidateAggregate
, but now we can reuse them for different aggregates, as well as test each in isolation. The cost is we have now spread this code across more types.
We can also write some general functions, concat
and concatMap
, to combine any list of Semigroup<T>
values into a single Semigroup<T>
value, effectively combining aggregates3. Here is an example of how to define and use these functions (as well as an example of testing Sum
and Max
in isolation):
Using our aggregation types
Now we can rewrite CandidateAggregate
using our aggregation types:
The type of aggregation used appears explicitly for each field in Aggregate
. For example largestSample: Max<Int>
conveys both the type of the result (Int
), as well as the process being used to calculated it (Max
). In CandidateAggregate
only the former was expressed. We also build some field types by composing semigroups, such as Mapped<MonthYear, Sum>
, which specifies we will be adding values using Sum
rather than some other approach. This also makes it very simple to update the method of aggregation (as illustrated below).
We have made Aggregate
itself a semigroup to define how we combine these composite aggregates. We’ve also added an empty
property to make it easier to call concat
and concatMap
.
The last piece we need is to translate a single Sample
into an Aggregate
, then we can do the entire aggregation using concatMap
as shown in the aggregateSamples()
test. Each Sample
gets transformed into an Aggregate
representing that individual sample (an aggregate of 1), then each Aggregate
in turn gets combined to calculate the required information across all the samples.
What have we gained for the price?
This definitely has more pieces that the CandidateAggregate
version (although the code for each piece has not changed much, it is now spread over multiple types). More pieces suggest a performance impact, but I have not measured this.
We do get a few benefits for this price. Firstly, we now have some small, simple, genuinely reusable aggregation types (Sum
, Max
, Min
, Mapped
etc.). These can be combined into other aggregates, and they can be tested in isolation. Secondly, we explicitly define aggregate types in terms of the aggregates of which they are composed. We don’t have an aggregate that contains an Int
, we have a Sum
or a Max<Int>
which conveys more information as to the aggregation process, as well as preventing errors (summing two Int
values that should have been combined using maxOf
for example).
We also make it simpler to change our aggregation. For example, if we wanted to change from reporting the total value to the maximum value for each month, we can change Mapped<MonthYear, Sum>
to Mapped<MonthYear, Max<Int>>
and the aggregation process will adjust accordingly.
Conclusion
We introduced a Semigroup<T>
interface which represents values that can be combined with an associative, binary operation. We also introduced concat
and concatMap
operations that work for any instance of this interface. We created Sum
, Max
, Min
, Nullable
and Mapped
instances of this interface to represent common methods of aggregation, then built a custom Aggregate
semigroup composed of some of these instances.
This is a bit more complex compared than manually aggregating a set of values over a loop or fold, but in return gives us reusable and testable aggregate types, more communicative types for our aggregate model, less opportunities for bugs in the aggregation process, as well as making the creation of new aggregates and modifications to existing aggregates simpler.
Suggested reading
- Understanding monoids, a three part series on monoids (a special case of semigroup) by Scott Wlaschin at the excellent F# for fun and profit site.
Example of multiple queries:
↩val minDate = samples.map { it.date }.min() val maxSample = samples.map { it.value }.max() val inRangeCount = samples.count { (100..200).contains(it.value) }
A semigroup for a type
T
consists of a closed binary operationT -> T -> T
that is also associative (i.e.a + (b + c) == (a + b) + c
). This associativity constraint means we can combine and compose these values fairly flexibly. For example, we can doa + b + c
, without having to worry about wetherb
is itself a composite ofx + y
, as associativity guaranteesa + (x + y) + c
is the same as((a + x) + y) + c
. We can’t do the same thing with non-associative operations like subtraction:100 - (30 - 10) - 5 /= ((100 - 30) - 10) - 5 75 /= 55
The end result is we can use associativity to combine values without having to also take evaluation order into account.↩
Both
concat
andconcatMap
take anempty: T
value for cases where theitems
lists are empty. We could use aMonoid
constraint instead ofSemigroup
, which adds the concept of an empty identity element, but I found this messy to implement in Kotlin.↩