Re: Error:could not extend file " with FileFallocate(): No space left on device

Thomas Munro <thomas.munro@xxxxxxxxx> · Fri, 13 Sep 2024 10:01:53 +1200

On Thu, Sep 12, 2024 at 8:54 PM Pecsök Ján <jan.pecsok@xxxxxxxxxxx> wrote:
> In link you provided there is mention, that in PostgreSQL 16 data is not being
> compressed for PostgreSQL 16 server. Does it mean, that PosgreSQL 16 use much more space while computing queries?
> If that is the case, it can be our problem, because our queries use sometimes several TB of disk space for computation and if there is considerable increase in disk usage during the queries, it can happen, that sometimes 27TB is not enough.

The kind of compression discussed there is a btrfs feature.  Xfs
doesn't have compression.

> I have 2 questions,
>
> Is there any workaround, that Posgres wont use FileFallocate? Maybe set something in Linux not to allow Posgres to use it?

Not currently.  I was thinking of proposing to introduce a setting and
back-patching it into 16, because it's a sort of regression for btrfs
users (and a hard one to foresee).  It is not at all clear what
exactly is happening on this xfs system, but something else...

> The change was introduced in Posgres 16, does it mean, that Posgres 15.8 should have old behaviour?

Yes.

> We dont use COPY in our queries.

OK so it's presumably due to having lots of concurrent DML operations
(most likely INSERT, could also be UPDATE) that need to extend the
relation.  I'm not sure of the exact behaviour of the heuristics
off the top of my head (but basically it's driven by waitcount[1])...
perhaps if you had only 7 concurrent DML operations and not 8+, it
would be less likely to take the fallocate path, something like
that...  That "8" is the threshold I was thinking of turning into a
GUC, perhaps in the November minor release, but it's not exactly clear
that posix_fallocate() is really the problem.  (I see that there have
been bugs in xfs's posix_fallocate() space accounting, but the one
that I found was about redundant posix_fallocate() over a region that
is already allocated, which PostgreSQL doesn't do.)

However it is far from clear what is actually going wrong here.
Although it seems to imply a pretty weird/bogus use of ENOSPC by the
kernel, that link I posted seems to be hinting that something a bit
different is going on.  It may be clutching at straws, but you might
try increasing those ulimits.  I'm not sure how to try to reproduce it
in lab conditions since it's apparently pretty hard to hit, based on
your 1-2 week MTBF on what sounds like a massive and busy system.
Hmm...

[1] https://github.com/postgres/postgres/commit/00d1e02be24987180115e371abaeb84738257ae2