Re: Speeding up an in-progress wraparound-preventing vacuum

Vincent de Phily <vincent.dephily@xxxxxxxxxxxxxxxxx> · Tue, 09 Dec 2014 01:58:11 +0100

On Monday 08 December 2014 10:17:37 Jeff Janes wrote:
> On Mon, Dec 8, 2014 at 4:54 AM, Vincent de Phily <
> 
> > bloat, which I'd like to get back asap). Currently about 80% of the IO is
> > devoted to the vacuum process (on average throughout the day, as
> > extrapolated
> > from atop output).
> 
> Is that 80% of the actually occurring IO, or 80% of the maximum possible IO?

80% of max possible IO, although I'm not sure how atop determines what the max 
is. It's a fairly reliable metric of "the drive is a busy as can be" in my 
experience.

> Increasing maintenance_work_mem even further, at least temporarily and
> locally for this operation, might be a good idea.

Ok, I thought 512M was already overgenerous, but I'll try increasing that too.

> > My first question is: is there a way to speedup the currently-running
> > vacuum
> > process ? I guess killing it to let it be replaced by a more agressively-
> > configured one would work, but I'd hate to lose 2 weeks of processing (is
> > there a way to estimate how much more vacuuming work remains to be done
> > ?),
> > and I'm being a bit more cautious with the wraparound-preventing kind.
> 
> I don't know of a way to speed it up gracefully.  That has frustrated me a
> few times, and a solution would really be nice.
> 
> If you kill it, the new process will have to re-read the entire table, but
> it will have much less work to do since the killed process already 'plowed
> a path' for it.  In a sense, killing the process will throw away all of the
> sequential read work on the table that has already been done, but the index
> reads and the writing workload is not all lost, it will save the new
> process time on those.

Ah good point, even if it has to re-read all the pages, it'll only have to 
write fozenxid for the pages that weren't processed before, that's good to 
hear.

Isn'there also something about vacuum marking a page as "all empty", so that 
it can be skiped by the next run ?

I don't get what index read is saved between vacuum runs ? I have 64G of RAM 
on this box, so there's no hope of the index staying in memory.

> You can `strace` for the lseek command to see which file handles it is
> currently working on, and
> use lsof to turn those into names.  You want to look at where it is in the
> table files, not the index files.

Neat trick, thanks. I'll try it in the morning when I'm fresh. Are the files 
read in C-sorting order ?

> > Second question is: how come we reached the wraparound threshold on this
> > table
> > at all ? We've only been keeping 28 days of data in this table, doing
> > incremental deletes every day (there are no updates at all). I find it
> > very
> > unlikely that we'd go through 2M transactions in that timespan (that'd
> > need
> > 890 transactions per second, we're well below 100). The
> > pg_class.relfozenxid
> > on that table is at 680M, while most other tables are around 860M. Could
> > it be
> > that routine vacuums haven't been able to update the relfrozenxid in a
> > long
> > long time, or am I missing something else ?
> 
> PostgreSQL doesn't know what your delete routine is like.  It has to verify
> with its own eyes that there are no rows over a certain age.

My assumption was that routine vacuuming was able to update the table's 
relfroxenxid, but I realize now that PG needs to do a full scan before 
updating that.

> I don't think that routine vacuums even attempts to update relfrozenxid, or
> at least doesn't try very hard.

AFAIU it does (the 6th paragraph in 23.1.4 seems to imply it), and 
vacuum_freeze_min_age controls when. Perhaps lowering that value would help 
prepare a lot of the anti-wraparound work.

Pity there's no "frozen pages map" (or is there ?) to keep track of pages will 
all-frozen tuples, it could speed up that anti-wraparound vacuum.

> Are you sure that routine vacuums have been running to completion on this
> table, as opposed to getting interrupted by something before finishing each
> time?

I checked pg_stat_user_tables.last_{auto_,}vacuum for this table and they are 
both null. This is seriously worrying. I've seen autovacuum take a few days on 
this table but always assumed that it finished properly. And I'm pretty sure 
vacuuming does some work, otherwise my disk usage woul grow much faster. I 
have no idea what could cause vacuuming to systematically stop before the end. 
Maybe I'll find something in the logs.

Thanks for your answers.

-- 
Vincent de Phily

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general