Re: Extremely slow to establish connection when user has a high number of roles

Tom Lane <tgl@xxxxxxxxxxxxx> · Sat, 20 Apr 2024 11:52:47 -0400

Michal Charemza <michal@xxxxxxxxxxxxx> writes:
> Tom Lane <tgl@xxxxxxxxxxxxx> writes:
>> It's not very clear what you mean by "sometimes".  Is the slowness
> reproducible for a particular user and role configuration, or does
> it seem to come and go by itself?

> Ah it's more come and go by itself - as in one connection takes 30 seconds,
> then the next say 0.06s. It's happened for every user we've tried. Even
> more anecdotally, I would say it happens more when the database is busy in
> terms of tables being dropped/created and permissions changing.

OK, that pretty much eliminates the idea that it's a new manifestation
of the catcache-inefficiency problem.  Vijaykumar may well have the
right idea, that it's a form of catalog bloat.  Do you do bulk
permissions updates that might affect thousands of role memberships at
once?

> Also: realise we did have one user that had directly was a member of
> several thousand roles, but indirectly several million. It would sometimes
> take 10 minutes for that user to connect. We've since changed that to one
> role, and that user connects fine now.

Interesting --- but even for that user, it was sometimes fast to
connect?

> I am wondering - what happens on connection? What catalogue tables does
> PostgreSQL check and how? What's allowed to happen concurrently and what
> isn't? If I knew, maybe I could come up with a reproduction script that
> does reproduce the issue?

Well, it's going to be looking to see that the user has CONNECT
privileges on the target database.  If that db doesn't have public
connect privileges, but only grants CONNECT to certain roles, then
we'll have to test whether the connecting user is a member of those
roles --- which involves looking into pg_auth_members and possibly
even doing recursive searches there.  For the sort of setup you're
describing with thousands of role grants (pg_auth_members entries)
it's not hard to imagine that search being rather expensive.  What
remains to be explained is how come it's only expensive sometimes.

The catalog-bloat idea comes from thinking about how Postgres handles
row updates.  There will be multiple physical copies (row versions)
of any recently-updated row, and this is much more expensive to scan
than a static situation with only one live row version.  First just
because we have to look at more than one copy, and second because
testing whether that copy is the live version is noticeably more
expensive if it's recent than once it's older than the xmin horizon,
and third because if we are the first process to scan it since it
became dead-to-everybody then it's our responsibility to mark it as
dead-to-everybody, so that we have to incur additional I/O to do that.
A plausible idea for particular connection attempts being slow is that
they came in just as a whole lot of pg_auth_members entries became
dead-to-everybody, and hence they were unlucky enough to get saddled
with a whole lot of that hint-bit-updating work.  (This also nicely
explains why the next attempt isn't slow: the work's been done.)

But this is only plausible if you regularly do actions that cause a
lot of pg_auth_members entries to be updated at the same time.
So we still don't have good insight into that, and your test script
isn't shedding any light.

A couple of other thoughts:

* I don't think your test script would show a connection-slowness
problem even if there was one to be shown, because you forgot to
revoke the PUBLIC connect privilege on the postgres database.
I'm fairly sure that if that exists it's always noticed first,
bypassing the need for any role membership tests.  So please
confirm whether your production database has revoked PUBLIC
connect privilege.

* It could be that the problem is not associated with the
database's connect privilege, but with role membership lookups
triggered by pg_hba.conf entries.  Do you have any entries there
that require testing membership (i.e. the role column is not
"all")?

			regards, tom lane