# Calculation of weighted averages when joining tables in SQL – Java, SQL and jOOQ.

I fell on a very interesting jOOQ question on Stack Overflow which required the calculation of a weighted average. Why is that.

### description of the problem

Assuming you have this database (using PostgreSQL syntax):

```
create table transactions (
id     bigint         not null primary key,
lines  bigint         not null,
price  numeric(18, 2) not null,
profit numeric(18, 2) not null
);

create table lines (
id             bigint         not null primary key,
transaction_id bigint         not null references transactions,
total          bigint         not null,
quantity       bigint         not null,
profit         numeric(18, 2) not null
);
```

As we can see, this diagram is slightly denormalized because the number of rows per transaction is precomputed in the `transactions.lines` column. This will prove to be very useful for this calculation, but it is not strictly necessary.

Now, in the previously linked Stack Overflow question, a report was desired which would calculate:

• An aggregation of the sums provided by the posts
• An aggregation of the averages provided by the transactions

It would be simple with two separate requests:

Amounts provided by posts

```
SELECT
sum(profit)   AS total_profit,
sum(total)    AS total_sales_amount,
sum(quantity) AS total_items_sold
FROM lines
```

Averages provided by transactions

```
SELECT
avg(lines)  AS avg_items_p_trx,
avg(price)  AS avg_price_p_trx,
avg(profit) AS avg_profit_p_trx
FROM transactions
```

So far, so good.

### Do it in one request

Now these queries are simplified compared to the original, which had to join the two tables in order to add additional predicates. Suppose also that these tables are quite large. Therefore, running two queries may cause the report to be sluggish. A single request would be much better.

We could try to simply combine the two:

```
-- Wrong query
SELECT
sum(l.profit)   AS total_profit,
sum(l.total)    AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
avg(t.lines)    AS avg_items_p_trx,
avg(t.price)    AS avg_price_p_trx,
avg(t.profit)   AS avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
```

But this request is false. While the sums are always correct, the averages are not, simply because the join produces duplicate transaction rows per row. Imagine a transaction with 3 or 5 lines:

```
SELECT
l.id    AS line_id,
t.id    AS transaction_id,
t.lines,
t.price
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
```

The output would be:

```
LINE_ID    TRANSACTION_ID    LINES    PRICE
-------------------------------------------
1          1                 3        20.00
2          1                 3        20.00
3          1                 3        20.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
```
• The average number of “avg_items_p_trx” lines should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate `avg(t.lines)` on the set of data, we obtain 4.25 (3 × 3 lines + 5 × 5 lines) / 8 items.
• The average price “avg_price_p_trx” must be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate `avg(t.price)` on the data set, we get 80.00 (3 × 20.00 + 5 × 100.00) / 8 elements.

### How can this be resolved?

Since every transaction is duplicated due to the join with rows, we need to calculate a weighted average, not an ordinary average. The idea is that instead of using the `AVG()` aggregate function, now we need to divide the value that we want to average by the number of elements (i.e. the number of times the value is repeated because of the join) and then divide the sum of this division by the number of transactions.

Prose never describes logic well, so let’s code. The correct query is:

```
SELECT
sum(l.profit)   AS total_profit,
sum(l.total)    AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
sum(t.lines  / t.lines) / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
```

With the above data set:

```
LINE_ID  TRANSACTION_ID  LINES  LINES/LINES   PRICE  PRICE/LINES
----------------------------------------------------------------
1        1               3      1             20.00         6.66
2        1               3      1             20.00         6.66
3        1               3      1             20.00         6.66
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
```

We now get the correct weighted averages:

• The average number of “avg_items_p_trx” lines is now 4 =
(3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / separate operations
• The average price “avg_price_p_trx” is now 60.00 =
(20.00 / 3 + 20.00 / 3 + 20.00 / 3 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5 + 100.00 / 5) / 2 separate operations

Note that “avg_items_p_trx” can be simplified:

```
SELECT
sum(l.profit)   AS total_profit,
sum(l.total)    AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
count(*)                / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
```

### Standardized version

Note that this solution took advantage of the fact that the number of rows per transaction was pre-calculated. We can of course also calculate it on the fly, for example using window functions. If it wasn’t available, we could do it like this:

```
SELECT
sum(l.profit)   AS total_profit,
sum(l.total)    AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
count(*)                / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price  / l.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM (
SELECT
l.*,
count(*) OVER (PARTITION BY l.transaction_id) lines
FROM lines AS l
) AS l
JOIN transactions AS t ON t.id = l.transaction_id
```

Or, we turn the entire join into a 1: 1 relationship by pre-aggregating all row data into one row per transaction. This works because we are only calculating the sums from the rows table:

```
SELECT
sum(l.profit_per_transaction)   AS total_profit,
sum(l.total_per_transaction)    AS total_sales_amount,
sum(l.quantity_per_transaction) AS total_items_sold,
avg(l.lines_per_transaction)    AS avg_items_p_trx,
avg(t.price)                    AS avg_price_p_trx,
avg(t.profit)                   AS avg_profit_p_trx
FROM (
SELECT
l.transaction_id
sum(l.profit)   AS profit_per_transaction,
sum(l.total)    AS total_per_transaction,
sum(l.quantity) AS quantity_per_transaction,
count(*)        AS lines_per_transaction
FROM lines AS l
GROUP BY l.transaction_id
) AS l
JOIN transactions AS t ON t.id = l.transaction_id
```