Group By

I recently came across a situation in javascript where I needed to do a “pivot” on some data, which amounted to needing the GROUP BY functionality in SQL, or groupBy which is provided as standard in multiple languages like Kotlin. Javascript doesn’t have that function out of the box and writing it ended up being an interesting little project to understand the nuances of that transformation. I’m going to step through it as simply as possible here to outline what I mean.

In the simplest terms, the shape of the problem is to “pivot” a data structure around a key. It’s a really common pattern with data…so common, in fact, that at first I was a little taken aback when I was building it that it wasn’t simpler. The basic problem I had was: take a list of categories that each have an array of month, value pairs and show me how much value was in each month regardless of category.

This was the simple test I started with, so you can see the starting and expected objects:

it('group by month is correct', () => {
    const categories = [
        {
            name: "Category One",
            values: [
                {month: "Jan", value: "5"},
                {month: "Feb", value: "10"},]
        },
        {
            name: "Category Two",
            values: [
                {month: "Jan", value: "10"},
                {month: "Feb", value: "12"},]
        },
    ];
    const actual = categoryTotals(categories);
    const expected = [{month: 'Jan', value: 15}, {month: 'Feb', value: 22}];
    expect(actual).toStrictEqual(expected);
})

In even simpler languages, I started with a list of categories with month/value data and I needed to see a list of months with the sum of all it’s relevant values. Pretty normal use case, right?

Below is the code that I ended up with, which is really three separate steps.

  1. First, aggregate all the values into one structure, in this case from the test [{"Jan",5},{"Feb",10},{"Jan",10},{"Feb",12}]. The idea here is to make it all one thing, flattened with just the data we want so it’s easy to work with.

  2. Second, a groupBy generic function that takes all the categories and groups them by month as one structure.

  3. Third, a reduce over the entries in the map returned from groupBy to sum the values in each month into one sum.

const groupBy = (items, extractKey) => {
    return items.reduce((a, v) => {
        const k = extractKey(v);
        return {...a, [k]: (a[k] || []).concat(v)};
    }, {});
}

export const categoryTotals = (cat) => {
    const aggregate_values = cat.reduce((a,sc)=> a.concat(sc.values), []);

    return Object.entries(groupBy(aggregate_values, (v) => v.month))
        .reduce((result, [k, v]) => result.concat({
            month: k,
            value: v.reduce((a, n) => a + Number(n.value), 0)
        }), []);
}

One thing you might notice is that groupBy here as a function takes two parameters, the array of items to work with and a extractKey function. This format was suggested to me by a more experienced engineer and I immediately saw how it was superior to the way I was trying to initially write this function. By passing in extractKey to the groupBy function, then groupBy doesn’t need to know anything about the internal structure of the array it’s iterating over. We simply give it the methodology for getting the key we want to use.

So in this case, you can see I use an anonymous function (v) => v.month to extract the “month” value. If my data structure were more complex, I could have a more complex extractKey without having to refactor anything.

For someone like me who started on this engineering path writing a whole bunch of raw sql into my code (remember writing code like this??) it was sort of shocking how much complexity was wrapped up in the SQL version of this same transformation idea. In SQL this same idea might look like, if you had a (fictional) table called Subcategories with month and value columns.

SELECT SUM(value), Month
FROM Subcategories
GROUP BY Month;

Pretty elegant language, right? Since the last half-decade of working with mostly JSON formatted data coming through a REST API (or, lately, graphQL), getting a straightforward rowset back from a database feels magically intuitive.

If javascript did have a native groupBy, it might look something like the one in Kotlin. Although here the data structures look a bit different, it’s the same concept applied. Here there are a list of four Pair instances that are a tuple of month, value pairs. The Pair has no semantic in this case and is just a way to group related data, so in the “real” world this is much more likely to be a data class Category with a name and an array of months…but for the sake of example this is clearer.

Notice how you would still need the sum step after grouping. Here we end up with a List of Lists of Pairs after we group, so it would still be necessary to iterate over the entries in byMonth and reduce (and then again) to get each month to have a sum in it.

val categories = listOf(
    Pair("Jan", 5), 
    Pair("Jan",10), 
    Pair("Feb", 10), 
    Pair("Feb",12)
)
val byMonth = categories
    .groupBy { it.first } // groups by pair first value, e.g. month

println(byMonth.keys) // ["Jan", "Feb"]
println(byMonth.values) // [[(Jan, 5), (Jan, 10)], [(Feb, 10), (Feb, 12)]]

One other thing to note about all of this in my javascript implementation (which could use some better error handling cough) is that I have chosen here to use immutable structures and functional programming. In simpler terms, I concat onto the array instead of pushing. For me, the safety of not mutating the structures feels like the right choice and I didn’t need to worry about performance in the context I was in. It was more important to not leak out a returned value that wasn’t guaranteed to be immutable or accidentally mutate the passed-in array.

However, making all these copies is definitely more memory-intensive and slower. If I were building a groupBy that was going to be used in unknown contexts as a library method, I would reimplement to optimize for performance and keep the making-copies to a minimum. It would be a rather large bummer to pass in a massive array to groupBy and have the whole system slow to a crawl or crash as the system ran out of memory.