Re: Dumping a database that is not accepting commands?

Natalie Wenz <nataliewenz@xxxxxxxxxxx> · Tue, 17 Sep 2013 11:48:10 -0500

On Sep 17, 2013, at 7:43 AM, Kevin Grittner <kgrittn@xxxxxxxxx> wrote:

Natalie Wenz <nataliewenz@xxxxxxxxxxx> wrote:

I have a large database from our test environment that got into trouble with
some high volume and some long-running queries about…six weeks ago? We have a
buffer mechanism that has been storing the new data since the database stopped
accepting connections, so we haven't really lost any data, which is good.
But the single-user backend vacuum freeze is still grinding away, using 100% cpu
most of the time, except when it's doing a lot of writes. We did cancel the
vacuum once so we could stop the database and take a snapshot of the filesystem
(we are using ZFS on FreeBSD) and copy the snapshot to another machine. This
allowed us to get a fresh database started (with 9.3!) where we could unspool
the last six weeks of data into a fresh database, and be able to access at least
that much of our data.

I'm not sure how you could have done that without dealing with the
wraparound before the upgrade.

Sorry; my description of what is going on was a little unclear. We didn't upgrade the existing database. We moved it to different hardware, and just created a brand new database to accept the data that had been backing up in sqlite files while our original database was offline. I'm still dealing with the wraparound on the original, just on a different machine.

Now:
I have a copy of the database (with data from all time up until the database
shut itself down six weeks ago) that I just need the data from. I am becoming
impatient with the vacuum, as it appears to have not even started working on the
files for one of the largest relations in the database (that table was about
14TB last I saw). I'm trying to find alternatives to waiting another who
knows how many weeks for the vacuum to finish just to have the database in a
state where I can dump the data out, since this is no longer the
"live" version. This copy running on hardware with plenty of space to
work with. The database has about a million transactions before it wraps.

The copy running on 9.3, or the original?
The brand new database on 9.3 is running great. The original, running on 9.1, is the one I'd like to extract data from (it is currently vacuuming). After I get the data out of the original, whether I wait for the vacuum to finish or not, I'll have to manually import all of the old data into the new database, but it was decided that that was a reasonable tradeoff if it meant we could get our current data out of the buffer files and get our regular tools and reports back online today. We want all of the older data eventually, but it's not critical for our day-to-day operation.  

Is it possible to somehow just dump the data using pg_dump or pg_dumpall? I
haven't tried to see if those utilities will work when the database is
protecting itself from data loss. If it were possible, would it be wise (do
those utilities use tons of transactions to dump, or would it be safely within
the million or so that I have)? I suppose I could use copy?

pg_dump uses COPY (by default, anyway), and does all its work in a
single transaction.  But it needs to start that transaction.

I'll give that a try. It sounds like I technically have enough transactions left to copy the data out, so that is good. Hopefully the database will let me use that utility. Otherwise I'll just do a copy from the backend. 

Would there be any risks if I were to do that? Maybe none of this
is risky at this point because we can always clone the original
snapshot, and try again.

I'm not sure that pg_dump followed by restore would be expected to
be faster than finishing the VACUUM, unless that is configured to
pace itself way too slowly.

That is actually a great point. I guess I'm anxious to *do* something, but maybe it will be so slow I might not gain much. I have been wondering about our vacuum configuration, actually. I would absolutely welcome feedback in this area, either in general, how a database like this should be tuned, and also if there is any tuning to change when there is an emergency backend vacuum that has to be run, in the interest of the vacuum running as aggressively as possible.

 version                         | PostgreSQL 9.1.9 on x86_64-unknown-freebsd9.1, compiled by gcc (GCC) 4.2.1 20070831 patched [FreeBSD], 64-bit
 autovacuum                      | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_freeze_max_age       | 800000000
 autovacuum_max_workers          | 3
 autovacuum_vacuum_cost_delay    | 0
 autovacuum_vacuum_scale_factor  | 0.1
 checkpoint_segments             | 128
 effective_cache_size            | 12GB
 listen_addresses                | *
 log_autovacuum_min_duration     | 10s
 log_destination                 | stderr
 log_filename                    | logfile-%A.log
 log_line_prefix                 | %t:%u:%r:[%p]: 
 log_rotation_age                | 1d
 log_rotation_size               | 1GB
 log_truncate_on_rotation        | on
 logging_collector               | on
 maintenance_work_mem            | 10GB
 max_connections                 | 500
 max_stack_depth                 | 2MB
 random_page_cost                | 1
 seq_page_cost                   | 1
 shared_buffers                  | 128MB
 synchronous_commit              | off
 temp_buffers                    | 128MB
 TimeZone                        | US/Central
 vacuum_cost_limit               | 500
 wal_buffers                     | 32MB
 work_mem                        | 256MB

This is the tuning of the original database, anything changed from the default settings. The machine it was running on had 48GB of memory. The database was 36TB, with 2 tables taking up the bulk of that (about 14TB each), and about 10 other tables and a few large indexes making up the rest. Our typical usage pattern is mostly inserts, with a some hourly summaries (which take maybe 5 minutes), some daily summaries (which take about 20-40 minutes), and a couple of end of month queries that take several hours. We have the same setup and tuning in production, which is about the same size, with an additional end of month query that runs off one of the 14TB tables, which can take 4-7 days. 

Any ideas or suggestions?

After getting past this crisis, I would take a close look at your
vacuuming regimen -- it sounds like it is not aggressive enough to
keep you out of trouble.

Normally the autovacuum seem to keep up sufficiently. We got into trouble with a bad combination of not-typical long-running queries, a disk failure and subsequent zpool repair, and the only person who checks the log files regularly (me) was out of the office. This has been so painful and slow to recover from, I don't think we'll ever get into this mess again. (At least not quite like this. I seem to have a knack for finding *new* ways to break things.) I will also take this opportunity to mention again that if anyone is considering a making a patch for 64-bit xids, you would make at least one small group of people very, very happy.  :)

Side question: some of the longer queries that were running when the database yakked were deletes of old data that had been manually migrated to a new table with a more appropriate format. We were running out of disk space, so we were trying to clear up some space by removing data we now had in two places. Does a delete of previously-frozen rows unfreeze them, or anything like that? Because in a series of maybe a dozen queries or so, we deleted billions of rows. Does that generate a significant amount of extra work for the autovacuumer? 

I'm sorry that I don't have a better suggestion for resolving the
crisis than running VACUUM at maximum speed.

Thanks for your feedback! It's kind of a dismal situation, but I'm anxious to learn whatever I can from it. 

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin