Avoiding double-counting in aggregates with more than one join?

Paul Jungwirth <pj@xxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 18 Nov 2016 09:16:23 -0800

Hi All,

I've noticed in the past that doing aggregates while joining to more 
than one table can sometimes give you unintended results. For example, 
suppose I have three tables: products, sales, and resupplies. In sales I 
track what I sell, and in resupplies I track my own purchases to 
increase inventory. Both have a foreign key to products. Now I want to 
run a report showing the total dollars sold for each product versus the 
total dollars spent for each product. I could try this:

    SELECT  p.id,
            SUM(s.price * s.qty) AS total_sold,
            SUM(r.price * r.qty) AS total_spent
    FROM    products p
    LEFT OUTER JOIN sales s
    ON      s.product_id = p.id
    LEFT OUTER JOIN resupplies r
    ON      r.product_id = p.id
    GROUP BY p.id
    ;

That seems pretty safe, but actually I get bad answers,
for example if some product has this data:

    sales
    -----
    sold 1 @ $2/ea

    resupplies
    ----------
    bought 1 @ $1/eq
    bought 2 @ $1/ea

Then pre-grouping I have this:

    p.id | s.qty | s.price | r.qty | r.price
    -----+-------+---------+-------+--------
       1 |     1 |      $2 |     1 |      $1
       1 |     1 |      $2 |     2 |      $1

You can see the problem is that I'm going to double-count my sales.
What I really want is this:

    p.id | s.qty | s.price | r.qty | r.price
    -----+-------+---------+-------+--------
       1 |     1 |      $2 |     1 |      $1
       1 |       |         |     2 |      $1

In the past I've always handled these situations by aggregating each 
table separately
and only then joining things together:

    WITH
    s AS (
      SELECT  product_id,
              SUM(price * qty) AS total_sold
      FROM    sales
      GROUP BY product_id) s
    ),
    r AS (
      SELECT  product_id,
              SUM(price * qty) AS total_spent
      FROM    resupplies
      GROUP BY product_id) r
    )
    SELECT  p.id,
            COALESCE(s.total_sold, 0),
            COALESCE(r.total_spent, 0)
    FROM    products p
    LEFT OUTER JOIN s
    ON      s.product_id = p.id
    LEFT OUTER JOIN r
    ON      r.product_id = p.id
    ;

Since I've guaranteed that each CTE includes at most one row per product,
this is safe from double-counting errors.

But is there a better way? My approach feels verbose
and harder to read. Also you have to type COALESCE a lot. :-)
Is there some different way of doing things I haven't thought of yet?

Also I wonder about the performance merging all these subqueries together.
Would the final merging be any faster if I had an ORDER BY in each CTE?

It seems like this pattern comes up a lot;
what have others done about it?

Thanks,
Paul

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general