On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera <alvherre@xxxxxxxxxxxxxxx> wrote: > Thomas Munro wrote: > >> Thanks. As mentioned elsewhere in the thread, I discovered that the >> same problem exists for page boundaries, with a different error >> message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 >> and master with the same results: >> >> FATAL: could not access status of transaction 2048 >> DETAIL: Could not read from file "pg_multixact/offsets/0000" at >> offset 8192: Undefined error: 0. >> >> FATAL: could not access status of transaction 131072 >> DETAIL: Could not open file "pg_multixact/offsets/0002": No such file >> or directory. > > So I checked this bug against current master, because it's claimed to be > closed. The first script doesn't emit a message at all; the second > script does emit a message: > > LOG: could not truncate directory "pg_multixact/offsets": apparent wraparound > > If you start and stop again, there's no more noise in the logs. That's > pretty innocuous -- great. Right, I included a fix for this in https://commitfest.postgresql.org/5/265/ which handles both pg_subtrans and pg_multixact, since it was lost in the noise in this thread... Hopefully someone can review that. > But then I modified your script to do two segments instead of one. Then > after the second cycle is done, start the server and stop it again. The > end result is a bit surprising: you end up with no files in > pg_multixact/offsets at all! Ouch. I see why: latest_page_number gets initialised to a different value when you restart (computed from oldest multixact ID, whereas during normal running it remembers the last created page number), so in this case (next == oldest, next % 2048 == 0), restarting the server moves latest_page_number forwards by one, so SimpleLruTruncate no longer bails out with the above error message and it happily deletes all files. That is conceptually OK (there are no multixacts, so no files should be OK), but see below... Applying the page linked above prevents this problem (it always keeps at least one multixact and therefore at least one page and therefore at least one segment, because it steps back one multixact to avoid boundary problems when oldest == next). As for whether it's actually OK to have no files in pg_multixact/offsets, it seems that if you restart *twice* after running checkpoint-segment-boundary.sh, you finish up with earliest = 4294965248 in TruncateMultiXact, because this code assumes that there was at least one file found and then proceeds to assign (-1 * 2048) to earliest (which is unsigned). trunc.earliestExistingPage = -1; SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc); earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE; if (earliest < FirstMultiXactId) earliest = FirstMultiXactId; I think this should bail out if earliestExistingPage is still -1 after the call to SlruScanDirectory. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general