SQL DISTINCT is not a function – Java, SQL and jOOQ.


A very common misconception that I often come across with SQL users is the idea that DISTINCT is something like a function, and it can take arguments in parentheses. Just recently I saw this stack overflow issue where the OP was looking for a way to express this in jOOQ:


SELECT DISTINCT (emp.id), emp.fname, emp.name
FROM employee emp;

Notice the parentheses around (emp.id), which looks like a special type of DISTINCT use, which is similar to a DISTINCT a function. The idea is often that:

  • Behavior is somewhat different from omitting parentheses
  • The performance is faster, because only the ID needs to be taken into account for the distinction

This is a mistake

These claims are incorrect, of course. There is no semantic or performance difference between the two. Parentheses are simply parentheses around a column expression, in the same way you would use parentheses to influence the precedence of operators. Think of it this way:


SELECT DISTINCT (emp.id + 1) * 2, emp.fname, emp.name
FROM employee emp;

In the example above, we do not apply a “DISTINCT function” to the expression emp.id + 1. We just put parentheses around a column expression emp.id + 1 to ensure that addition occurs before multiplication. the DISTINCT the operator occurs after the projection, always. If SQL had used a more logical syntax, rather than following English grammar (it was originally called Sstructure EEnglish WHATry Language, or SEQUEL), then we would write the OP declaration like this:


FROM employee
SELECT id, fname, name
DISTINCT

Again, the DISTINCT the operation is still happening after projection (SELECT content of the clause), and applies to entirety of the projection. There is no way in standard SQL to apply the distinction only to parts of the projection (there is in PostgreSQL, see below).

To clarify this a bit more, I recommend you read our previous blog post on logical order of operations in SQL, and how DISTINCT and ORDER BY relate.

What would that mean anyway?

We can send the question back to the PO and ask ourselves what that would mean for a DISTINCT operation to be applied to a single column, anyway? Suppose this dataset:


|id |fname|name|
|---|-----|----|
|1  |A    |A   |
|1  |B    |B   |

If we applied DISTINCT only to the ID column (and didn’t project anything else), obviously we would only get a single row as a result:


SELECT DISTINCT id FROM employee

|id |
|---|
|1  |

But if we also wanted to project FNAME and NAME, which line would “win”? Would we show the first or the second row, or any random row? The behavior would be undefined and SQL doesn’t like undefined behaviors, so this is not possible. The only reasonable application of DISTINCT is always on the entire projection.

Exception: PostgreSQL

Fortunately (or to add more to the syntactic confusion: unfortunately), PostgreSQL has implemented an extension to the SQL standard. With DISTINCT ON, it is indeed possible to apply the distinction only to parts of the projection:


WITH emp (id, fname, name) AS (
  VALUES (1, 'A', 'A'),
         (1, 'B', 'B')
)
SELECT DISTINCT ON (id) id, fname, name
FROM emp
ORDER BY id, fname, name

The output is now what the OP wanted (but couldn’t use, since they were using MySQL):


|id         |fname|name |
|-----------|-----|-----|
|1          |A    |A    |

personally I don’t like DISTINCT ON. While it is very useful, without a doubt, it makes something even more complicated that is already very difficult to explain to newbies in SQL. With a “more reasonable” syntax, the query would be written like this:


FROM emp
SELECT id, fname, name
ORDER BY id, fname, name
DISTINCT ON (id) 

With this syntactic order of operation, there would be no doubt about the semantics of DISTINCT or DISTINCT ON.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *