Re: Query on partitioned table needs memory n_partitions * work_mem

Dimitrios Apostolou <jimis@xxxxxxx> · Fri, 19 Jul 2024 15:31:34 +0200 (CEST)

On Thu, 18 Jul 2024, David Rowley wrote:

On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis@xxxxxxx> wrote:

* The memory is unnecessarily allocated early on, before any partitions
   are actually aggregated. I know this because I/O is slow on this device
   and the table sizes are huge, it's simply not possible that postgres
   went through all partitions and blew up the memory. That would take
   hours, but the OOM happens seconds after I start the query.

That's interesting. Certainly, there is some memory allocated during
executor startup, but that amount should be fairly small.  Are you
able to provide a self-contained test case that shows the memory
blowing up before execution begins?

I'm trying hard to create a self-contained way to reproduce the issue.
It's not easy, the behaviour is a bit unstable. So far I see high memory
usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed
once, but it was so uncontrollable that I didn't manage to measure and
reproduce again later).

-- I create a table with 2k partitions.

CREATE TABLE   partitioned_table1(
    run_n       bigint   GENERATED ALWAYS AS IDENTITY,
    workitem_n  integer  NOT NULL,
    label_n     smallint,
    result      smallint NOT NULL,
    PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

DO $$
for i in range(0, 2000):
    stmt = f'''
        CREATE TABLE part_max{i+1}M
            PARTITION OF partitioned_table1
            FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000})
    '''
    plpy.execute(stmt)
$$ LANGUAGE plpython3u;

-- I insert random data. First I insert to all partitions, 1M rows each:

DO $$
for i in range(0, 2000):
    stmt = f'''
        INSERT INTO partitioned_table1(workitem_n, label_n, result)
            SELECT
                    j-j%4,
                    CAST(random()*1000 AS INTEGER),
                    CAST(random()*3    AS INTEGER)
                FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1,
1) as j
    '''
    plpy.info(stmt)
    plpy.execute(stmt)
    plpy.commit()
$$ LANGUAGE plpython3u;

-- Disable parallel execution and group aggregate:

SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;

SET SESSION work_mem TO '8MB';

-- Now the following query should do a HashAggregate:

SELECT
     workitem_n, label_n, bool_or(result IN (2,3))
FROM
     partitioned_table1
GROUP BY
     workitem_n, label_n
LIMIT 10;

-- How much was the RSS of the backend while the previous query was
-- running? Not that high. But if we insert some million rows to the
-- 1st partition, then it will be much higher.

DO $$
for i in range(0,2000):
    stmt = f'''
        INSERT INTO partitioned_table1(workitem_n, label_n, result)
            SELECT
                    j%1000000,
                    CAST(random()*20000 AS INTEGER),
                    CAST(random()*4    AS INTEGER)
                FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j
    '''
    plpy.info(stmt)
    plpy.execute(stmt)
    plpy.commit()
$$ LANGUAGE plpython3u;

-- Now that same previous query consumes between 8GB and 10GB RSS. The
-- more data I insert (to all partitions?), the more memory the query
-- takes.

Overall:

* I don't see the RSS memory usage (8GB) growing proportionally as I
  expected. If I increase work_mem from 4MB to 8MB then I see double RSS
  memory usage (from ~4GB to ~8GB). But then if I increase it further the
  difference is miniscule and no OOM happens.

* Instead I notice RSS memory usage growing slowly while I insert more and more data
  to the table (especially into the 1st partition I think).

* Finally I don't see the memory being free'd by the backend after the
  SELECT finishes. The system is relieved only when I disconnect psql and
  the backend dies. Not sure if that's by design or not.

Having wasted long time in that, the minimum I can do is submit a
documentation patch. At enable_partitionwise_aggregate someting like
"WARNING it can increase the memory usage by at least
n_partitions * work_mem". How do I move on for such a patch? Pointers
would be appreciated. :-)

I think mentioning something about this in enable_partitionwise_join
and enable_partitionwise_aggregate is probably wise. I'll propose a
patch on pgsql-hackers.

David and Ashutosh, thank you both for your interest in improving the
documentation. Unfortunately I'm not positive any longer on what exactly
is going on here, I don't understand how the memory is growing. One thing
I can verify is that it's definitely caused by partitioning: I have
another similar huge table but unpartitioned, and no such issues show up.
Maybe someone with knowledge of the HashAggregate algorithm and
partitioning can throw some ideas in.

Regards,
Dimitris