Re: SELECT creates millions of temp files in a single directory

Thomas Munro <thomas.munro@xxxxxxxxx> · Sun, 24 Apr 2022 13:36:29 +1200

On Sun, Apr 24, 2022 at 8:00 AM Peter <pmc@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
> More than a million files in a single directory, this is
> inacceptable.

You didn't show EXPLAIN (ANALYZE) but if [Parallel] Hash is making
insane numbers of temporary files then something is not working as
intended... and I can take a guess at what it is.  I tried to recreate
this with randomly distributed file.pathid, same size tables as you,
and I got 32 batches and a nice manageable number of temporary files.
Adding millions of extra file rows with duplicate pathid=42 gets me
something like "Batches: 524288 (originally 32)" in EXPLAIN (ANALYZE)
output.  I guess that's the sort of distribution you have here?

Extensive discussion of the cause of that and potential (hard)
solutions over here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

To summarise, when the hash table doesn't fit in work_mem, then we
"partition" (spill part of the data to disk) to make twice as many
(hopefully) smaller hash tables that do fit.  Sometimes partitoning
produces one or more hash tables that are still too big because of
uneven key distribution, so we go around again, doubling the number of
partitions (and thus temporary files) every time.  I would say that
once we get past hundreds of partitions, things are really turning to
custard (file descriptors, directory entries, memory overheads, ... it
just doesn't work well anymore), but currently we don't give up until
it becomes very clear that repartitioning is not helping.

This algorithmic problem existed before parallelism was added, but
it's possible that the parallel version of the meltdown hurts a lot
more (it has extra per-process files, and in multi-attempt scenarios
the useless earlier attempts hang around until the end of the query
instead of being cleaned up sooner, which doubles the number of
files).

Hopefully that gives some clues about how one might rewrite the query
to avoid massive unsplittable set of duplicate keys in hash tables,
assuming I'm somewhere close to the explanation here (maybe some
subselect with UNIQUE or GROUP BY in it, or some way to rewrite the
query to avoid having the problematic duplicates on the "inner" side,
or completely different plan..., or just crank up work_mem massively
for this query so that you don't need a partition step at all) .
Obviously it would be nice for PostgreSQL to be hardened against this
risk, eventually, though.

As for merge join planning, unfortunately they aren't fully
parallelisable and in the plan you showed, a separate copy of the big
sort runs in every process, which isn't nice (we don't have a Parallel
Sort that could fix that yet).