On Sun, Apr 24, 2022 at 8:00 AM Peter <pmc@xxxxxxxxxxxxxxxxxxxxxxx> wrote: > More than a million files in a single directory, this is > inacceptable. You didn't show EXPLAIN (ANALYZE) but if [Parallel] Hash is making insane numbers of temporary files then something is not working as intended... and I can take a guess at what it is. I tried to recreate this with randomly distributed file.pathid, same size tables as you, and I got 32 batches and a nice manageable number of temporary files. Adding millions of extra file rows with duplicate pathid=42 gets me something like "Batches: 524288 (originally 32)" in EXPLAIN (ANALYZE) output. I guess that's the sort of distribution you have here? Extensive discussion of the cause of that and potential (hard) solutions over here: https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com To summarise, when the hash table doesn't fit in work_mem, then we "partition" (spill part of the data to disk) to make twice as many (hopefully) smaller hash tables that do fit. Sometimes partitoning produces one or more hash tables that are still too big because of uneven key distribution, so we go around again, doubling the number of partitions (and thus temporary files) every time. I would say that once we get past hundreds of partitions, things are really turning to custard (file descriptors, directory entries, memory overheads, ... it just doesn't work well anymore), but currently we don't give up until it becomes very clear that repartitioning is not helping. This algorithmic problem existed before parallelism was added, but it's possible that the parallel version of the meltdown hurts a lot more (it has extra per-process files, and in multi-attempt scenarios the useless earlier attempts hang around until the end of the query instead of being cleaned up sooner, which doubles the number of files). Hopefully that gives some clues about how one might rewrite the query to avoid massive unsplittable set of duplicate keys in hash tables, assuming I'm somewhere close to the explanation here (maybe some subselect with UNIQUE or GROUP BY in it, or some way to rewrite the query to avoid having the problematic duplicates on the "inner" side, or completely different plan..., or just crank up work_mem massively for this query so that you don't need a partition step at all) . Obviously it would be nice for PostgreSQL to be hardened against this risk, eventually, though. As for merge join planning, unfortunately they aren't fully parallelisable and in the plan you showed, a separate copy of the big sort runs in every process, which isn't nice (we don't have a Parallel Sort that could fix that yet).