Thanks Laurenz & Imran for your comments.
My responses inline below.
Thanks
Tom
On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:I have a large database (multi TB) which had a vacuum full running but the database ran out of space during the rebuild of one of the large data tables. Cleaning down the WAL files got the database restarted (an archiving problem led to the initial disk full). However, the disk space is still at 99% as it appears the large table rebuild files are still hanging around using space and have not been deleted. My problem now is how do I get this space back to return my free space back to where it should be? I tried some scripts to map the data files to relations but this didn't work as removing some files led to startup failure despite them appearing to be unrelated to anything in the database - I had to put them back and then startup worked. Any suggestions here?That reads like the sad old story: "cleaning down" WAL files - you mean deleting the very files that would have enabled PostgreSQL to recover from the crash that was caused by the full file system. Did you run "pg_resetwal"? If yes, that probably led to data corruption.
No, I just removed the excess already archived WALs to get space
and restarted. The vacuum full that was running had created files
for the large table it was processing and these are still hanging
around eating space without doing anything useful. The shutdown
prevented the rollback cleanly removing them which seems to be the
core problem.
The above are just guesses. Anyway, there is no good way to get rid of the files that were left behind after the crash. The reliable way of doing so is also the way to get rid of potential data corruption caused by "cleaning down" the database: pg_dump the whole thing and restore the dump to a new, clean cluster. Yes, that will be a painfully long down time. An alternative is to restore a backup taken before the crash.
My issue now is the dump & reload is taking a huge time; I
know the hardware is capable of multi-GB/s throughput but the
reload is taking a long time - projected to be about 10 days to
reload at the current rate (about 30Mb/sec). The old server and
new server have a 10G link between them and storage is SSD backed,
so the hardware is capable of much much more than it is doing now.
Is there a way to improve the reload performance? Tuning of any type - even if I need to undo it later once the reload is done.
My backups were in progress when all the issues happened, so
they're not such a good starting point and I'd actually prefer the
clean reload since this DB has been through multiple upgrades
(without reloads) until now so I know it's not especially clean.
The size has always prevented the full reload before but the
database is relatively low traffic now so I can afford some time
to reload, but ideally not 10 days.
Yours, Laurenz Albe