On Thu, Jul 2, 2020 at 5:43 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Anders Steinlein <anders@xxxxxx> writes:
>> Even that perhaps isn't conclusive, so you could
>> also try comparing the pg_rewrite.ev_action fields for the views'
>> ON SELECT rules. (That might be a bit frustrating because of likely
>> inconsistencies in node "location" fields; but any other difference
>> is cause for suspicion.)
> You're right, ev_action is indeed different:
> ...
> Is there somehow I can format them to make it easier to compare? My basic
> attempts didn't help me much. I put them up in all their glories in
> pastebins, since they are rather large. Please let me know if there is
> somehow I can make this easier to look into.
Yeah, _expression_ trees are pretty unreadable :-(. I downloaded these,
changed all the "location" fields to -1 to make them more comparable,
and behold there are still a bunch of diffs. Here's one:
[...]
This is the internal form of a "JOIN ... USING (email)" construct.
I didn't try to trace this back to exactly where it was in the source
queries. The important thing here is that we have a couple of Vars
of type 106893, which I gather must be citext or a domain over it.
Yes, we have a domain called `email` over citext. The reason for the multiple casts to citext, IIRC, was that i.e. array_agg() didn't accept the email domain directly.
In the first tree, those are coerced via a no-op RelabelType operation
into plain text (type OID 25) and then compared with the built-in texteq
operator. In the second tree, they are coerced to some other non-built-in
type (maybe plain citext?) and then compared with operator 106108.
I am betting that 106084 is citext, 106108 is citext's equality operator,
and the net implication of all this is that the original matview is doing
the JOIN using case-sensitive equality whereas the new one is using
case-insensitive equality.
This makes sense. We've narrowed the different results down to this exact case; that is, for the rows that are missing in the old matview, they have "email" entries in different case between some of the tables. So case-sensitive equality checks on these will naturally be lost somewhere. How this came to be is another question...
A plausible explanation for how things got that way is that citext's
equality operator wasn't in your search_path when you created the original
matview, but it is in view when you make the new one, allowing that
equality operator to capture the interpretation of USING.
Possibly, since this view has existed for many years. However in general, our multi-tenant migration system does SET search_path TO <tenant>, public for all DDL we do, so unless we had an issue whey back when it should've been present.
Unfortunately,
since the reverse-listing of this join is just going to say "USING
(email)", there's no way to detect from human-readable output that the
interpretation of the USING clauses is different. (We've contemplated
introducing not-SQL-standard syntax to allow flagging such cases, but
haven't pulled the trigger on that.)
If I'm reading this correctly, would this be a "reason" to be more explicit when doing joins involving non-standard data types? I.e. would it be "safer" to do ON x1.email::citext == x2.email::citext instead of USING (if there is any difference at all...)?
I count five places in the query with similar operator substitutions.
There are some other diffs in the trees that are a bit odd, but might be
explained if the new view was made by dump/reload rather than from the
identical SQL text the original view was made from; they all look like
they are references to JOIN output columns rather than the underlying
table columns or vice versa. That's probably harmless, but the different
join operators certainly are not.
I looked over the procedure we used for the upgrade, and it was this: Postgres 9.4 server backup using WAL-E, restored into 9.4 on a new box, where we did some data cleanup and upgraded extensions including citext. Then used pg_dump from 12, directory format, and restored into 12. Since that time no DDL or extension changes have been made on the tables/view involved here. Could the citext upgrade have had any effect here, messing with dependencies somehow?
Anyway, I think I'll just recreate these views for all tenants to be on the safe side. Still curious about how this came to be, but since it's about 10 years of history in this database I guess it will be hard to figure anything out for sure.
Thanks a lot for looking into this, let me know if there's any reason to dig further.
Best,
-- a.