Sunday, December 16, 2012

DAX Groupers: SUMMARIZE and AGGX(VALUES())

Groupers are your friends
Sometimes you need to perform what I'd call a multi-pass aggregation.  Rolling up data to a higher level of granularity and then performing additional calculations on it.  Basically, the need to achieve the equivalent of SQL's GROUP BY in DAX, and then apply some additional calcs. 

Luckily DAX has a few Groupers that can handle this quite nicely.


The other day I helped someone in the forums with a grouping question like this, and he asked for some resources to learn more about this useful technique.  So here's my spin on it from a PowerPivot expression perspective.

Ok, so let's work through this with a demo scenario.  We'll use AdventureWorks since everybody knows that dataset.  Let's take Internet Sales (the FactInternetSales table in AdventureWorksDW).  We've got sales data at the line item level (i.e., the individual items purchased within a given shopping basket or ticket).  And we want to analyze average sales over time or sliced by segment.  But we want to analyze at the ticket level, not the item level.  So a regular Average won't work here.  Our challenge then is to roll up sales to the ticket level (sum it up by ticket), and then average across tickets.

This is a very common requirement, and one that is easily satisfied in DAX.  There are actually several ways to accomplish this, so let's walk through them one at a time.

Approach #1: Total Sum of Sales / Distinct Tickets
This is probably the simplest approach and actually doesn't do any grouping.  I'm just including it to be thorough in solving the scenario at hand.  Anyway, it works in this scenario, but you won't always be able to calculate in 1 pass like this.  Regardless, here's the formula for reference.

=SUM( FactInternetSales[SalesAmount] )
   / DISTINCTCOUNT( FactInternetSales[SalesOrderNumber] )


Approach #2: AggX(VALUES)
With the next two approaches, instead of rolling all the way up like in #1, we want to take a two-step approach.  We group by the "ticket" and sum up the underlying item sales amounts, and then calculate an average across all of the tickets.

Here's the actual formula for our scenario and we'll break into it's components.

=AVERAGEX(VALUES(FactInternetSales[SalesOrderNumber])
                         ,CALCULATE(SUM(FactInternetSales[SalesAmount]))
                        )

There are three key parts to this formula.
1. The VALUES function behaves like a subquery and returns a list of the distinct SalesOrderNumbers that we want to group by (the list of tickets).
2. The CALCULATE(SUM()) defines how we will roll SalesAmount up to the ticket level (we're grouping by SalesOrderNumber here).
3. The AVERAGEX is the outer function that iterates over each of the items in the list (each SalesOrderNumber or "ticket").  For each "ticket", it executes the CALCULATE(SUM) to roll SalesAmount up to the ticket level.  And then it calculates the average of those values for the final result.

Approach #3: AggX(SUMMARIZE)
Essentially, this approach accomplishes the same thing as approach #2 above.  But instead of using VALUES to get the list of SalesOrderNumbers, we'll use SUMMARIZE to do the grouping. 

The SUMMARIZE function works like this:

SUMMARIZE(<table>, <groupBy_columnName>[, <groupBy_columnName>]…[, <name>, <expression>]…)

The first parameter is the table you want to roll up (FactInternetSales)
The second parameter is the column to group by (FactInternetSales[SalesOrderNumber])
Optionally, you can group by additional columns.  And you can add aggregation expressions, similar to a GROUP BY in SQL.  

Those last two optional parts give SUMMARIZE an extra bit of flexibility that can come in handy when you need more than just a single-column list back.  Anyway, probably easier with an example, so here's our formula:

=AVERAGEX(
                       SUMMARIZE( FactInternetSales
                                              ,FactInternetSales[SalesOrderNumber]
                                             )
                       ,CALCULATE( SUM( FactInternetSales[SalesAmount] ) )
                      )


In this formula, the inner SUMMARIZE is working over the FactInternetSales table and grouping by SalesOrderNumber.  The result is a list of the distinct SalesOrderNumbers we need (just like VALUES produced).  With that list, the outer AVERAGEX iterates through each one, calculating the sum of the underlying SalesAmounts, and then calculating an average across all of the ticket-level sums.

Another approach that leverages SUMMARIZE, and that can be useful in a variety of situations, is to perform aggregations while grouping and return both the grouping columns and the aggregations to the outer function.  So in our scenario, we can actually move the CALCULATE(SUM()) that rolls item-level sales up to the ticket level inside the SUMMARIZE.  Like this:

=AVERAGEX(
                       SUMMARIZE( FactInternetSales
                                              ,FactInternetSales[SalesOrderNumber]
                                              ,"SumTicket"

                                              ,CALCULATE( SUM( FactInternetSales[SalesAmount] ) )
                                             )
                       ,[SumTicket]
                      )


With that, the inner SUMMARIZE behaves like a SQL subquery that returns, not only the list of tickets (SalesOrderNumbers), but also the rolled-up SUM of item-level SalesAmounts for each ticket.  Then with that two-column table, the outer AVERAGEX iterates over each row, picking up the values for SumTicket and then calculates an average over the list for the final result.

Pretty straight-forward.  The only twist is that, within the SUMMARIZE, you give the aggregate expression an "alias" (in our case "SumTicket").  So the table that SUMMARIZE returns includes that column and you can reference it by name in the outer AggX function.  Not really tricky, but the intellisense doesn't pick it up, so you have to make sure you reference the name correctly.

These grouping techniques (#2 and #3) can be absolutely critical in situations where you have data at different granularities that you need to aggregate in a multi-pass fashion like above.  Or in situations where you need to grab a date or value from each segment in your dataset.

Just to give you some more ideas for applying this approach, here is another example that I recently came up against in the forums. 

The requirement was for a fitness training club, and they needed to calculate the number of sessions attended and the number of sessions purchased summarized for each customer and across all customers.  The shape of the data was like this:

CustomerID CoachingSession SessionsPurchased
10013
10023
10033
10043
10111
10121

And the goal output was:
 
Clients SessionsPurchased
24

Essentially, we have a fact table with session attendance transactions.  The real challenge was with the sessions purchased, since they are pivoted out and duplicated on each session attendance row.

Given time and resources, I would recommend breaking the sessions purchased data out into a separate fact table.  This would be the proper way to model it dimensionally, since the two events are independent and occur at different grains.

But for this exercise we'll deal with the data in the shape it's in.  

In this scenario, we can use the SUMMARIZE function to group by the customer and get the MAX sessions purchased for each like this:

=SUMX(
              SUMMARIZE( Coaching
                                     ,Coaching[CustomerID]
                                     ,"MaxPurchased"
                                     ,MAX( Coaching[SessionsPurchased] )               
                                   )
             ,[MaxPurchased]      
            )
 
Or if you don't want to use SUMMARIZE (whether it's preference or perhaps you are stuck using PowerPivot v1), you can get the distinct list of customers to group by using VALUES like this:

=SUMX(
              VALUES( Coaching[CustomerID] )      
             ,CALCULATE( MAX( Coaching[SessionsPurchased] ) )     
            )
 
In both cases, notice that the measure uses an outer SUMX to iterate through each of the rows from the inner grouping function.  And then sums the results of those iterations.  Ultimately, we've removed the duplicate values and simply summed the sessions purchased at the customer level.

That's it for now.  Happy holidays!
 

3 comments:

  1. I can't tell you how much this helped me out. Thanks for the post!

    ReplyDelete
    Replies
    1. Thanks for the feedback Elaine. Glad it was useful.

      Delete
  2. nice article, Brent -- very helpful -- keep up the great work!!!

    ReplyDelete