Calculation of weighted averages when joining tables in SQL – Java, SQL and jOOQ.

I show You how To Make Huge Profits In A Short Time With Cryptos!

I fell on a very interesting jOOQ question on Stack Overflow which required the calculation of a weighted average. Why is that.

description of the problem

Assuming you have this database (using PostgreSQL syntax):


create table transactions (
  id     bigint         not null primary key,
  lines  bigint         not null,
  price  numeric(18, 2) not null,
  profit numeric(18, 2) not null
);

create table lines (
  id             bigint         not null primary key,
  transaction_id bigint         not null references transactions,
  total          bigint         not null,
  quantity       bigint         not null,
  profit         numeric(18, 2) not null
);

As we can see, this diagram is slightly denormalized because the number of rows per transaction is precomputed in the transactions.lines column. This will prove to be very useful for this calculation, but it is not strictly necessary.

Now, in the previously linked Stack Overflow question, a report was desired which would calculate:

  • An aggregation of the sums provided by the posts
  • An aggregation of the averages provided by the transactions

It would be simple with two separate requests:

Amounts provided by posts


SELECT
  sum(profit)   AS total_profit,
  sum(total)    AS total_sales_amount,
  sum(quantity) AS total_items_sold
FROM lines

Averages provided by transactions


SELECT
  avg(lines)  AS avg_items_p_trx,
  avg(price)  AS avg_price_p_trx,
  avg(profit) AS avg_profit_p_trx
FROM transactions

So far, so good.

Do it in one request

Now these queries are simplified compared to the original, which had to join the two tables in order to add additional predicates. Suppose also that these tables are quite large. Therefore, running two queries may cause the report to be sluggish. A single request would be much better.

We could try to simply combine the two:


-- Wrong query
SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  avg(t.lines)    AS avg_items_p_trx,
  avg(t.price)    AS avg_price_p_trx,
  avg(t.profit)   AS avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

But this request is false. While the sums are always correct, the averages are not, simply because the join produces duplicate transaction rows per row. Imagine a transaction with 3 or 5 lines:


SELECT
  l.id    AS line_id,
  t.id    AS transaction_id,
  t.lines,
  t.price
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

The output would be:


LINE_ID    TRANSACTION_ID    LINES    PRICE
-------------------------------------------
1          1                 3        20.00
2          1                 3        20.00
3          1                 3        20.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
  • The average number of “avg_items_p_trx” lines should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate avg(t.lines) on the set of data, we obtain 4.25 (3 × 3 lines + 5 × 5 lines) / 8 items.
  • The average price “avg_price_p_trx” must be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate avg(t.price) on the data set, we get 80.00 (3 × 20.00 + 5 × 100.00) / 8 elements.

How can this be resolved?

Since every transaction is duplicated due to the join with rows, we need to calculate a weighted average, not an ordinary average. The idea is that instead of using the AVG() aggregate function, now we need to divide the value that we want to average by the number of elements (i.e. the number of times the value is repeated because of the join) and then divide the sum of this division by the number of transactions.

Prose never describes logic well, so let’s code. The correct query is:


SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  sum(t.lines  / t.lines) / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

With the above data set:


LINE_ID  TRANSACTION_ID  LINES  LINES/LINES   PRICE  PRICE/LINES
----------------------------------------------------------------
1        1               3      1             20.00         6.66
2        1               3      1             20.00         6.66
3        1               3      1             20.00         6.66
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00

We now get the correct weighted averages:

  • The average number of “avg_items_p_trx” lines is now 4 =
    (3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / separate operations
  • The average price “avg_price_p_trx” is now 60.00 =
    (20.00 / 3 + 20.00 / 3 + 20.00 / 3 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5) / 2 separate operations

Note that “avg_items_p_trx” can be simplified:


SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

Made!

Standardized version

Note that this solution took advantage of the fact that the number of rows per transaction was pre-calculated. We can of course also calculate it on the fly, for example using window functions. If it wasn’t available, we could do it like this:


SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / l.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM (
  SELECT 
    l.*,
    count(*) OVER (PARTITION BY l.transaction_id) lines
  FROM lines AS l
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

Or, we turn the entire join into a 1: 1 relationship by pre-aggregating all row data into one row per transaction. This works because we are only calculating the sums from the rows table:


SELECT
  sum(l.profit_per_transaction)   AS total_profit,
  sum(l.total_per_transaction)    AS total_sales_amount,
  sum(l.quantity_per_transaction) AS total_items_sold,
  avg(l.lines_per_transaction)    AS avg_items_p_trx,
  avg(t.price)                    AS avg_price_p_trx,
  avg(t.profit)                   AS avg_profit_p_trx
FROM (
  SELECT 
    l.transaction_id
    sum(l.profit)   AS profit_per_transaction,
    sum(l.total)    AS total_per_transaction,
    sum(l.quantity) AS quantity_per_transaction,
    count(*)        AS lines_per_transaction
  FROM lines AS l
  GROUP BY l.transaction_id
) AS l
JOIN transactions AS t ON t.id = l.transaction_id



Source link

Leave a Reply

Your email address will not be published.