Michal Charemza <michal@xxxxxxxxxxxxx> writes: > Tom Lane <tgl@xxxxxxxxxxxxx> writes: >> It's not very clear what you mean by "sometimes". Is the slowness > reproducible for a particular user and role configuration, or does > it seem to come and go by itself? > Ah it's more come and go by itself - as in one connection takes 30 seconds, > then the next say 0.06s. It's happened for every user we've tried. Even > more anecdotally, I would say it happens more when the database is busy in > terms of tables being dropped/created and permissions changing. OK, that pretty much eliminates the idea that it's a new manifestation of the catcache-inefficiency problem. Vijaykumar may well have the right idea, that it's a form of catalog bloat. Do you do bulk permissions updates that might affect thousands of role memberships at once? > Also: realise we did have one user that had directly was a member of > several thousand roles, but indirectly several million. It would sometimes > take 10 minutes for that user to connect. We've since changed that to one > role, and that user connects fine now. Interesting --- but even for that user, it was sometimes fast to connect? > I am wondering - what happens on connection? What catalogue tables does > PostgreSQL check and how? What's allowed to happen concurrently and what > isn't? If I knew, maybe I could come up with a reproduction script that > does reproduce the issue? Well, it's going to be looking to see that the user has CONNECT privileges on the target database. If that db doesn't have public connect privileges, but only grants CONNECT to certain roles, then we'll have to test whether the connecting user is a member of those roles --- which involves looking into pg_auth_members and possibly even doing recursive searches there. For the sort of setup you're describing with thousands of role grants (pg_auth_members entries) it's not hard to imagine that search being rather expensive. What remains to be explained is how come it's only expensive sometimes. The catalog-bloat idea comes from thinking about how Postgres handles row updates. There will be multiple physical copies (row versions) of any recently-updated row, and this is much more expensive to scan than a static situation with only one live row version. First just because we have to look at more than one copy, and second because testing whether that copy is the live version is noticeably more expensive if it's recent than once it's older than the xmin horizon, and third because if we are the first process to scan it since it became dead-to-everybody then it's our responsibility to mark it as dead-to-everybody, so that we have to incur additional I/O to do that. A plausible idea for particular connection attempts being slow is that they came in just as a whole lot of pg_auth_members entries became dead-to-everybody, and hence they were unlucky enough to get saddled with a whole lot of that hint-bit-updating work. (This also nicely explains why the next attempt isn't slow: the work's been done.) But this is only plausible if you regularly do actions that cause a lot of pg_auth_members entries to be updated at the same time. So we still don't have good insight into that, and your test script isn't shedding any light. A couple of other thoughts: * I don't think your test script would show a connection-slowness problem even if there was one to be shown, because you forgot to revoke the PUBLIC connect privilege on the postgres database. I'm fairly sure that if that exists it's always noticed first, bypassing the need for any role membership tests. So please confirm whether your production database has revoked PUBLIC connect privilege. * It could be that the problem is not associated with the database's connect privilege, but with role membership lookups triggered by pg_hba.conf entries. Do you have any entries there that require testing membership (i.e. the role column is not "all")? regards, tom lane