I fell on a very interesting jOOQ question on Stack Overflow which required the calculation of a weighted average. Why is that.
description of the problem
Assuming you have this database (using PostgreSQL syntax):
create table transactions ( id bigint not null primary key, lines bigint not null, price numeric(18, 2) not null, profit numeric(18, 2) not null ); create table lines ( id bigint not null primary key, transaction_id bigint not null references transactions, total bigint not null, quantity bigint not null, profit numeric(18, 2) not null );
As we can see, this diagram is slightly denormalized because the number of rows per transaction is precomputed in the transactions.lines
column. This will prove to be very useful for this calculation, but it is not strictly necessary.
Now, in the previously linked Stack Overflow question, a report was desired which would calculate:
- An aggregation of the sums provided by the posts
- An aggregation of the averages provided by the transactions
It would be simple with two separate requests:
Amounts provided by posts
SELECT sum(profit) AS total_profit, sum(total) AS total_sales_amount, sum(quantity) AS total_items_sold FROM lines
Averages provided by transactions
SELECT avg(lines) AS avg_items_p_trx, avg(price) AS avg_price_p_trx, avg(profit) AS avg_profit_p_trx FROM transactions
So far, so good.
Do it in one request
Now these queries are simplified compared to the original, which had to join the two tables in order to add additional predicates. Suppose also that these tables are quite large. Therefore, running two queries may cause the report to be sluggish. A single request would be much better.
We could try to simply combine the two:
-- Wrong query SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, avg(t.lines) AS avg_items_p_trx, avg(t.price) AS avg_price_p_trx, avg(t.profit) AS avg_profit_p_trx FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id
But this request is false. While the sums are always correct, the averages are not, simply because the join produces duplicate transaction rows per row. Imagine a transaction with 3 or 5 lines:
SELECT l.id AS line_id, t.id AS transaction_id, t.lines, t.price FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id
The output would be:
LINE_ID TRANSACTION_ID LINES PRICE ------------------------------------------- 1 1 3 20.00 2 1 3 20.00 3 1 3 20.00 4 2 5 100.00 4 2 5 100.00 4 2 5 100.00 4 2 5 100.00 4 2 5 100.00
- The average number of “avg_items_p_trx” lines should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate
avg(t.lines)
on the set of data, we obtain 4.25 (3 × 3 lines + 5 × 5 lines) / 8 items. - The average price “avg_price_p_trx” must be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate
avg(t.price)
on the data set, we get 80.00 (3 × 20.00 + 5 × 100.00) / 8 elements.
How can this be resolved?
Since every transaction is duplicated due to the join with rows, we need to calculate a weighted average, not an ordinary average. The idea is that instead of using the AVG()
aggregate function, now we need to divide the value that we want to average by the number of elements (i.e. the number of times the value is repeated because of the join) and then divide the sum of this division by the number of transactions.
Prose never describes logic well, so let’s code. The correct query is:
SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, sum(t.lines / t.lines) / count(DISTINCT t.id) avg_items_p_trx, sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx, sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id
With the above data set:
LINE_ID TRANSACTION_ID LINES LINES/LINES PRICE PRICE/LINES ---------------------------------------------------------------- 1 1 3 1 20.00 6.66 2 1 3 1 20.00 6.66 3 1 3 1 20.00 6.66 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00
We now get the correct weighted averages:
- The average number of “avg_items_p_trx” lines is now 4 =
(3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / separate operations - The average price “avg_price_p_trx” is now 60.00 =
(20.00 / 3 + 20.00 / 3 + 20.00 / 3 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5) / 2 separate operations
Note that “avg_items_p_trx” can be simplified:
SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, count(*) / count(DISTINCT t.id) avg_items_p_trx, sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx, sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id
Made!
Standardized version
Note that this solution took advantage of the fact that the number of rows per transaction was pre-calculated. We can of course also calculate it on the fly, for example using window functions. If it wasn’t available, we could do it like this:
SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, count(*) / count(DISTINCT t.id) avg_items_p_trx, sum(t.price / l.lines) / count(DISTINCT t.id) avg_price_p_trx, sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx FROM ( SELECT l.*, count(*) OVER (PARTITION BY l.transaction_id) lines FROM lines AS l ) AS l JOIN transactions AS t ON t.id = l.transaction_id
Or, we turn the entire join into a 1: 1 relationship by pre-aggregating all row data into one row per transaction. This works because we are only calculating the sums from the rows table:
SELECT sum(l.profit_per_transaction) AS total_profit, sum(l.total_per_transaction) AS total_sales_amount, sum(l.quantity_per_transaction) AS total_items_sold, avg(l.lines_per_transaction) AS avg_items_p_trx, avg(t.price) AS avg_price_p_trx, avg(t.profit) AS avg_profit_p_trx FROM ( SELECT l.transaction_id sum(l.profit) AS profit_per_transaction, sum(l.total) AS total_per_transaction, sum(l.quantity) AS quantity_per_transaction, count(*) AS lines_per_transaction FROM lines AS l GROUP BY l.transaction_id ) AS l JOIN transactions AS t ON t.id = l.transaction_id