Re: Query Using Massive Temp Space

Tom Lane <tgl@xxxxxxxxxxxxx> · Tue, 21 Nov 2017 13:04:07 -0500

Cory Tucker <cory.tucker@xxxxxxxxx> writes:
> On Mon, Nov 20, 2017 at 9:36 AM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> The only thing I can think of offhand that could create temp files far in
>> excess of the actual data volume is if a hash join repeatedly decides that
>> it needs to increase the number of hash batches.  We have seen that happen
>> in cases where an inner relation contains a vast number of identical hash
>> key values: they all go into the same bucket, and the executor keeps
>> splitting the batch trying to reduce its size, but being all identical
>> values it will never be the case that some rows go into a different batch
>> than others.  There is logic in there to give up splitting when this
>> happens, but maybe it's not firing in your case, or maybe somebody broke
>> it recently :-(.

> What was happening is that there is a fairly non-unique "parent_id" value
> (66k times) that is creating a very large hashkey that it cannot break into
> a smaller chunks -- so, essentially what your guess was, Tom.  Perhaps
> worth investigating whether that code is still functioning as intended.

I tried to duplicate this result, using a test case constructed like this:

create table hj(f1 int, filler text);
alter table hj alter column filler set storage plain;  -- prevent compression
insert into hj select x, repeat('xyzzy', 4000/5) from generate_series(1,1000000) x;
insert into hj select 42, repeat('xyzzy', 4000/5) from generate_series(1,10000) x;
vacuum analyze hj;

set enable_mergejoin TO 0;
set max_parallel_workers_per_gather TO 0;

explain analyze
select count(h1.filler),count(h2.filler) from hj h1, hj h2 where h1.f1 = h2.f1;

This causes a hash join between two datasets a bit under 4GB each, with
the largest bucket being circa 40MB.  work_mem is at the default 4MB,
so that bucket will be large enough to force split attempts.

What I see is that the temp disk space rises slowly to ~7.5GB and then
declines smoothly to ~40MB, where it stays for awhile before the query
finishes.  The EXPLAIN ANALYZE report says

 Aggregate  (cost=10020068069.47..10020068069.48 rows=1 width=16) (actual time=81812.299..81812.299 rows=1 loops=1)
   ->  Hash Join  (cost=10001024835.00..10019362963.81 rows=141021131 width=8008) (actual time=10227.585..67092.480 rows=101020000 loops=1)
         Hash Cond: (h1.f1 = h2.f1)
         ->  Seq Scan on hj h1  (cost=0.00..515100.00 rows=1010000 width=4008) (actual time=0.023..2531.739 rows=1010000 loops=1)
         ->  Hash  (cost=515100.00..515100.00 rows=1010000 width=4008) (actual time=10197.922..10197.922 rows=1010000 loops=1)
               Buckets: 1024 (originally 1024)  Batches: 2097152 (originally 1024)  Memory Usage: 39462kB
               ->  Seq Scan on hj h2  (cost=0.00..515100.00 rows=1010000 width=4008) (actual time=0.003..2582.198 rows=1010000 loops=1)
 Planning time: 0.062 ms
 Execution time: 81832.867 ms

Now, there's definitely something busted here; it should not have gone as
far as 2 million batches before giving up on splitting.  But that breakage
isn't translating into disk space bloat.  I haven't traced through the
code yet, but I believe what's likely happening is that it's creating new
batch files but freeing the old ones, so that the disk space usage remains
constant during the fruitless split attempts.

So there's some aspect of what you're seeing that this simple case doesn't
capture.  I wonder if you can build a publishable test case that shows the
space bloat you're experiencing.  The behavior should only depend on the
hash join key values --- since those are apparently account numbers,
maybe they're nonsensitive, or you could obscure them without changing
the behavior?

			regards, tom lane