On 9/9/24 06:02, Philip Hazelden wrote:
The MERGE docs[1] give this warning:
Only columns from the target table that attempt to match
`data_source` rows should appear in `join_condition`.
`join_condition` subexpressions that only reference the target
table's columns can affect which action is taken, often in
surprising ways.
(The docs for upcoming v17 have the same line.)
So why should I avoid doing this? It's not clear to me whether the
warning is saying "this likely won't work like you expect because it's
difficult to reason about" or "because the behavior is unspecified" or
"because there's a bug" or what.
I found a thread[2] on the psql-hackers list which has this snippet of
conversation:
* It might make sense to point out in the docs that join_condition
should not filter the target table too much. Like SQL server docs say,
don't put things in the join that filter the target that actually
belong in the WHEN .. AND quals. In a way, this should be obvious,
because it's an outer join. But I don't think it is, and ISTM that the
sensible thing to do is to warn against it.
FYI the SQL Server note does not shed any more light on this:
https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver16
Caution
It's important to specify only the columns from the target table to use
for matching purposes. That is, specify columns from the target table
that are compared to the corresponding column of the source table. Don't
attempt to improve query performance by filtering out rows in the target
table in the ON clause; for example, such as specifying AND NOT
target_table.column_x = value. Doing so can return unexpected and
incorrect results.
This makes me think the warning is trying to say something like: "if
you can move a subexpression from `join_condition` to `WHEN ... AND`,
you should probably do so". Is that right?
(I still don't know *why* I should do that. It sounds like maybe it's
more efficient that way because this is an outer join? But I don't
know why that matters. If I'd had to guess which would be more
efficient, I'd have weakly guessed "prefer to do it in a join, indexes
will get used better that way".)
I would say because you could end with WHEN clauses that are at odds
with the JOIN clause. In other words you throw away rows in the JOIN
that you need later in the WHEN(s). Basically keep rows around until you
are sure they are not needed.
And I think it's not always possible to move a subexpression without
changing `data_source`. In the example I posted above, I don't think
it would work, since we'd no longer have a `NOT MATCHED` on `t2`'s
`(3, '1.3 v2')` row.
[1]: https://www.postgresql.org/docs/current/sql-merge.html
[2]: https://www.postgresql.org/message-id/flat/CANP8%2BjKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc%2BXrz8QB0nXA%40mail.gmail.com
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx