Re: filesystem full during vacuum - space recovery issues

Thomas Simpson <ts@xxxxxxxxxxxxxx> · Wed, 17 Jul 2024 09:24:43 -0400

    Thanks Laurenz & Imran for your comments.
    My responses inline below.
    Thanks
    Tom

    On 15-Jul-2024 20:58, Laurenz Albe
      wrote:

      On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:

        I have a large database (multi TB) which had a vacuum full running but the database
ran out of space during the rebuild of one of the large data tables.

Cleaning down the WAL files got the database restarted (an archiving problem led to
the initial disk full).

However, the disk space is still at 99% as it appears the large table rebuild files
are still hanging around using space and have not been deleted.

My problem now is how do I get this space back to return my free space back to where
it should be?

I tried some scripts to map the data files to relations but this didn't work as
removing some files led to startup failure despite them appearing to be unrelated
to anything in the database - I had to put them back and then startup worked.

Any suggestions here?

      That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.

Did you run "pg_resetwal"?  If yes, that probably led to data corruption.

    No, I just removed the excess already archived WALs to get space
      and restarted.  The vacuum full that was running had created files
      for the large table it was processing and these are still hanging
      around eating space without doing anything useful.  The shutdown
      prevented the rollback cleanly removing them which seems to be the
      core problem.

      The above are just guesses.  Anyway, there is no good way to get rid of the files
that were left behind after the crash.  The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.

Yes, that will be a painfully long down time.  An alternative is to restore a backup
taken before the crash.

    My issue now is the dump & reload is taking a huge time; I
      know the hardware is capable of multi-GB/s throughput but the
      reload is taking a long time - projected to be about 10 days to
      reload at the current rate (about 30Mb/sec).  The old server and
      new server have a 10G link between them and storage is SSD backed,
      so the hardware is capable of much much more than it is doing now.

    Is there a way to improve the reload performance?  Tuning of any
      type - even if I need to undo it later once the reload is done.
    My backups were in progress when all the issues happened, so
      they're not such a good starting point and I'd actually prefer the
      clean reload since this DB has been through multiple upgrades
      (without reloads) until now so I know it's not especially clean. 
      The size has always prevented the full reload before but the
      database is relatively low traffic now so I can afford some time
      to reload, but ideally not 10 days.

      Yours,
Laurenz Albe