Re: pg_xlog becomes extremely large during CREATE INDEX

"Jeffrey W. Baker" <jwbaker@acm.org> · Wed, 12 May 2004 21:31:43 -0700

On Wed, 2004-05-12 at 20:18, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > I have configured 24 checkpoint segments,
> > which I expect gives me a worst-case usage in pg_xlog of 384MB. 
> > Unfortunately, during the CREATE INDEX, pg_xlog becomes full!
> 
> Postgres version?  AFAIR 7.1.2 or thereabouts would act that way,
> because it wouldn't discard log segments containing work of open
> transactions.

This is 7.4.1.

> The only reason I can think of for such behavior in a current version
> is if the checkpoint isn't completing for some reason.  Have you looked
> in the postmaster log to see if checkpoint is bleating about anything?

Oh sure, it's bleating.  Apparently my computer is too fast:

syslog:May 12 21:20:32 mistral postgres[17458]: [214-1] LOG:  checkpoints are occurring too frequently (12 seconds apart)
syslog:May 12 21:20:32 mistral postgres[17458]: [214-2] HINT:  Consider increasing the configuration parameter "checkpoint_segments".

That's also what preceeded the most recent crash:

May 12 16:36:53 mistral postgres[506]: [173-1] LOG:  checkpoints are occurring too frequently (16 seconds apart)
May 12 16:36:53 mistral postgres[506]: [173-2] HINT:  Consider increasing the configuration parameter "checkpoint_segments".
May 12 16:37:08 mistral postgres[506]: [174-1] LOG:  server process (PID 16403) was terminated by signal 6
May 12 16:37:08 mistral postgres[506]: [175-1] LOG:  terminating any other active server processes
May 12 16:37:08 mistral postgres[26212]: [3-1] WARNING:  terminating connection because of crash of another server process
May 12 16:37:08 mistral postgres[26212]: [3-2] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server
May 12 16:37:08 mistral postgres[26212]: [3-3]  process exited abnormally and possibly corrupted shared memory.
May 12 16:37:08 mistral postgres[26212]: [3-4] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
May 12 16:37:08 mistral postgres[14396]: [1-1] WARNING:  terminating connection because of crash of another server process
May 12 16:37:08 mistral postgres[14396]: [1-2] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server
May 12 16:37:08 mistral postgres[14396]: [1-3]  process exited abnormally and possibly corrupted shared memory.
May 12 16:37:08 mistral postgres[14396]: [1-4] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
May 12 16:37:08 mistral postgres[14396]: [1-5] CONTEXT:  writing block 56889 of relation 17142/289600495
May 12 16:37:08 mistral postgres[506]: [176-1] LOG:  all server processes terminated; reinitializing
May 12 16:37:08 mistral postgres[17086]: [177-1] LOG:  database system was interrupted at 2004-05-12 16:36:53 PDT
May 12 16:37:08 mistral postgres[17086]: [178-1] LOG:  checkpoint record is at 6/92B32D48
May 12 16:37:08 mistral postgres[17086]: [179-1] LOG:  redo record is at 6/92B32D48; undo record is at 0/0; shutdown FALSE
May 12 16:37:08 mistral postgres[17086]: [180-1] LOG:  next transaction ID: 10232; next OID: 289600632
May 12 16:37:08 mistral postgres[17086]: [181-1] LOG:  database system was not properly shut down; automatic recovery in progress
May 12 16:37:08 mistral postgres[17086]: [182-1] LOG:  redo starts at 6/92B32D88
May 12 16:45:29 mistral postgres[17086]: [183-1] LOG:  could not open file "/var/lib/postgres/data/pg_xlog/0000000700000071" (log file 7, segment 113): No such file or directory
May 12 16:45:29 mistral postgres[17086]: [184-1] LOG:  redo done at 7/70FFC400
May 12 16:45:50 mistral postgres[506]: [177-1] LOG:  startup process (PID 17086) was terminated by signal 6
May 12 16:45:50 mistral postgres[506]: [178-1] LOG:  aborting startup due to startup process failure

I also have many "recycled" messages:

May 12 13:10:01 mistral postgres[16373]: [20-1] LOG:  recycled transaction log file "0000000600000078"
May 12 13:10:01 mistral postgres[16373]: [21-1] LOG:  recycled transaction log file "0000000600000079"
May 12 13:10:01 mistral postgres[16373]: [22-1] LOG:  recycled transaction log file "000000060000007A"
May 12 13:10:01 mistral postgres[16373]: [23-1] LOG:  recycled transaction log file "000000060000007B"
May 12 13:10:01 mistral postgres[16373]: [24-1] LOG:  recycled transaction log file "000000060000007C"
May 12 13:20:03 mistral postgres[16408]: [1-1] LOG:  removing transaction log file "0000000600000090"
May 12 13:20:03 mistral postgres[16408]: [2-1] LOG:  removing transaction log file "0000000600000091"

> Note that your calculation is off anyway: per documentation, recent
> versions will allow up to 2*checkpoint_segments+1 WAL files to exist.
> This is because we save the last two checkpoints worth of data.

That's why I gave it an order of magnitude leeway.  Such calculations
are always off by a factor or two.

It's doing it again at this very moment, I get the feeling that pg_xlog
is converging on the same size as $PGDATA/base/$OID/pgsql_tmp:

mistral:/backup_temp# du -hs pg_xlog /var/lib/postgres/data/base/17142/pgsql_tmp/
4.8G    pg_xlog
5.4G    /var/lib/postgres/data/base/17142/pgsql_tmp

-jwb

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)