Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid

Kevin Grittner <kgrittn@xxxxxxxxx> · Thu, 21 Feb 2013 12:46:53 -0800 (PST)

Ned Wolpert <ned.wolpert@xxxxxxxxxxxxx> wrote:

> I'm doing a postmortem on a corruption event we had. I have an
> idea on what happened, but not sure. I figure I'd share what
> happened and see if I'm close to right here.
>
> Running 9.1.6 with hot-standby, archiving 4 months of wal files,
> and even a nightly pg_dump all.

Those WAL files aren't going to be of much use without an
occasional base backup to apply them to.

> 50G database.  Trying to update or delete a row in a small (21
> row, but heavily used table) would lock up completely. Never
> finish.

How long is "never" in hours or days?

> Removed all clients, restarted the db instance, no joy.
> Check pg_stat_activity, and nothing that wasn't idle.... run the
> delete, locked up.

Did you have any rows in pg_prepared_xacts that had lingered for
longer than you were waiting for the delete?

> Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
> pg_class.oid;) with no clients touching this row

Of course you need to be connected to the right database when you
run this, and you need to look at relation locks -- row locks don't
show up in that view.

> fresh restart of the db, and saw virtualtransactions against this
> same table where the values would be -1/nnnnn were nnnnn was a
> huge number.  Turns out we had about 159 entries from different
> tables in the database.

Sounds like prepared transactions.  Again, what was in
pg_prepared_xacts?

> Checked hot-standby and, of course, no locks or anything.
> Switched to hot-standby.

Meaning you promoted it to master?

> Hot-standby instantly gained these locks, Also noticed that 2
> prepared transactions migrated to the hot-standby. Binary
> upgraded to 9.1.8, locks still existed. Ended up executing the
> one thing we knew would work. Take the site down, pg_dumpall to
> fresh instance. Everything is fine.

That all sounds consistent with a flaky transaction manager.

> A little more background: We were running 9.1.4 back when 9.1.6
> came out. We saw there was possible corruption issues and did a
> binary upgrade and reindexing. Everything seemed to be fine, but
> I wonder if we really had problems back then. We rebuilt the
> hot-standby after the binary upgrade via normal restore and
> wal-file replays.  I should also note that this row that had the
> lock on it that would not go away, was created by an app server
> that was killed (via -9) since it was non-responsive, and the row
> 'create date' (in db and logs) is the exact time the app server
> was killed.

It sounds as though you abruptly terminated the process running
your transaction manager, which left it unaware of one or more
dangling prepared transactions.  Further, it sounds like your
transaction manager doesn't go looking for such detritus.  If it's
not going to watch for such problems, you had better do so.  Any
prepared transaction which is sitting in pg_prepared_xacts for more
than a few seconds, I would consider suspect.  After a few minutes
hours I would consider them to be a problem.  After a day I would
consider the transaction manager to have fallen on its face, and
would go clean things up by either committing or rolling back the
prepared transaction(s).

> I was wondering if a) these virtualtransactions that start with
> '-1/' indicate a problem, b) if this could have happened from pre
> 9.1.6 corruption that was fixed in 9.1.6. Or, could this have
> occurred when we killed that app server? Or.... am I looking in
> the wrong place.

-1 as the process doesn't indicate a problem per se.  It indicates
the transaction has been "prepared" and is no longer associated
with a backend process or connection.

Something which was assuming the role of a transaction manager told
a transaction (or many of them) to prepare for commit as part of a
distributed transaction.  A transaction which says it successfully
completed such a PREPARE statement must hold all its locks and keep
all changes pending until it is told to commit or roll back, even
across database restarts.  It sounds like things were left in this
state for a very long time, which can lead to all kinds of
problems, notably bloat and blocking.

> I do still have the old data directories so I can start them up
> and check out the dataset. Any advice?

I would start it up and see what's in pg_prepared_xacts.

-- 
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general