Ned Wolpert <ned.wolpert@xxxxxxxxxxxxx> wrote: > I'm doing a postmortem on a corruption event we had. I have an > idea on what happened, but not sure. I figure I'd share what > happened and see if I'm close to right here. > > Running 9.1.6 with hot-standby, archiving 4 months of wal files, > and even a nightly pg_dump all. Those WAL files aren't going to be of much use without an occasional base backup to apply them to. > 50G database. Trying to update or delete a row in a small (21 > row, but heavily used table) would lock up completely. Never > finish. How long is "never" in hours or days? > Removed all clients, restarted the db instance, no joy. > Check pg_stat_activity, and nothing that wasn't idle.... run the > delete, locked up. Did you have any rows in pg_prepared_xacts that had lingered for longer than you were waiting for the delete? > Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation = > pg_class.oid;) with no clients touching this row Of course you need to be connected to the right database when you run this, and you need to look at relation locks -- row locks don't show up in that view. > fresh restart of the db, and saw virtualtransactions against this > same table where the values would be -1/nnnnn were nnnnn was a > huge number. Turns out we had about 159 entries from different > tables in the database. Sounds like prepared transactions. Again, what was in pg_prepared_xacts? > Checked hot-standby and, of course, no locks or anything. > Switched to hot-standby. Meaning you promoted it to master? > Hot-standby instantly gained these locks, Also noticed that 2 > prepared transactions migrated to the hot-standby. Binary > upgraded to 9.1.8, locks still existed. Ended up executing the > one thing we knew would work. Take the site down, pg_dumpall to > fresh instance. Everything is fine. That all sounds consistent with a flaky transaction manager. > A little more background: We were running 9.1.4 back when 9.1.6 > came out. We saw there was possible corruption issues and did a > binary upgrade and reindexing. Everything seemed to be fine, but > I wonder if we really had problems back then. We rebuilt the > hot-standby after the binary upgrade via normal restore and > wal-file replays. I should also note that this row that had the > lock on it that would not go away, was created by an app server > that was killed (via -9) since it was non-responsive, and the row > 'create date' (in db and logs) is the exact time the app server > was killed. It sounds as though you abruptly terminated the process running your transaction manager, which left it unaware of one or more dangling prepared transactions. Further, it sounds like your transaction manager doesn't go looking for such detritus. If it's not going to watch for such problems, you had better do so. Any prepared transaction which is sitting in pg_prepared_xacts for more than a few seconds, I would consider suspect. After a few minutes hours I would consider them to be a problem. After a day I would consider the transaction manager to have fallen on its face, and would go clean things up by either committing or rolling back the prepared transaction(s). > I was wondering if a) these virtualtransactions that start with > '-1/' indicate a problem, b) if this could have happened from pre > 9.1.6 corruption that was fixed in 9.1.6. Or, could this have > occurred when we killed that app server? Or.... am I looking in > the wrong place. -1 as the process doesn't indicate a problem per se. It indicates the transaction has been "prepared" and is no longer associated with a backend process or connection. Something which was assuming the role of a transaction manager told a transaction (or many of them) to prepare for commit as part of a distributed transaction. A transaction which says it successfully completed such a PREPARE statement must hold all its locks and keep all changes pending until it is told to commit or roll back, even across database restarts. It sounds like things were left in this state for a very long time, which can lead to all kinds of problems, notably bloat and blocking. > I do still have the old data directories so I can start them up > and check out the dataset. Any advice? I would start it up and see what's in pg_prepared_xacts. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general