Cory Tucker <cory.tucker@xxxxxxxxx> writes: > On Mon, Nov 20, 2017 at 9:36 AM Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> The only thing I can think of offhand that could create temp files far in >> excess of the actual data volume is if a hash join repeatedly decides that >> it needs to increase the number of hash batches. We have seen that happen >> in cases where an inner relation contains a vast number of identical hash >> key values: they all go into the same bucket, and the executor keeps >> splitting the batch trying to reduce its size, but being all identical >> values it will never be the case that some rows go into a different batch >> than others. There is logic in there to give up splitting when this >> happens, but maybe it's not firing in your case, or maybe somebody broke >> it recently :-(. > What was happening is that there is a fairly non-unique "parent_id" value > (66k times) that is creating a very large hashkey that it cannot break into > a smaller chunks -- so, essentially what your guess was, Tom. Perhaps > worth investigating whether that code is still functioning as intended. I tried to duplicate this result, using a test case constructed like this: create table hj(f1 int, filler text); alter table hj alter column filler set storage plain; -- prevent compression insert into hj select x, repeat('xyzzy', 4000/5) from generate_series(1,1000000) x; insert into hj select 42, repeat('xyzzy', 4000/5) from generate_series(1,10000) x; vacuum analyze hj; set enable_mergejoin TO 0; set max_parallel_workers_per_gather TO 0; explain analyze select count(h1.filler),count(h2.filler) from hj h1, hj h2 where h1.f1 = h2.f1; This causes a hash join between two datasets a bit under 4GB each, with the largest bucket being circa 40MB. work_mem is at the default 4MB, so that bucket will be large enough to force split attempts. What I see is that the temp disk space rises slowly to ~7.5GB and then declines smoothly to ~40MB, where it stays for awhile before the query finishes. The EXPLAIN ANALYZE report says Aggregate (cost=10020068069.47..10020068069.48 rows=1 width=16) (actual time=81812.299..81812.299 rows=1 loops=1) -> Hash Join (cost=10001024835.00..10019362963.81 rows=141021131 width=8008) (actual time=10227.585..67092.480 rows=101020000 loops=1) Hash Cond: (h1.f1 = h2.f1) -> Seq Scan on hj h1 (cost=0.00..515100.00 rows=1010000 width=4008) (actual time=0.023..2531.739 rows=1010000 loops=1) -> Hash (cost=515100.00..515100.00 rows=1010000 width=4008) (actual time=10197.922..10197.922 rows=1010000 loops=1) Buckets: 1024 (originally 1024) Batches: 2097152 (originally 1024) Memory Usage: 39462kB -> Seq Scan on hj h2 (cost=0.00..515100.00 rows=1010000 width=4008) (actual time=0.003..2582.198 rows=1010000 loops=1) Planning time: 0.062 ms Execution time: 81832.867 ms Now, there's definitely something busted here; it should not have gone as far as 2 million batches before giving up on splitting. But that breakage isn't translating into disk space bloat. I haven't traced through the code yet, but I believe what's likely happening is that it's creating new batch files but freeing the old ones, so that the disk space usage remains constant during the fruitless split attempts. So there's some aspect of what you're seeing that this simple case doesn't capture. I wonder if you can build a publishable test case that shows the space bloat you're experiencing. The behavior should only depend on the hash join key values --- since those are apparently account numbers, maybe they're nonsensitive, or you could obscure them without changing the behavior? regards, tom lane