How to craft a query that uses memory?

<Holger.Friedrich-Fa-Trivadis@xxxxxxxxx> · Fri, 19 Jun 2015 12:42:20 +0200

Hello list,

What would be the best way of setting up a query that uses lots of memory, in order to study the effects of resource usage parameters on the performance of the query?

I thought that a query on a large table involving JOINs would be a way, but while the backend initially used 1.7 gigs of  memory, it only uses a couple dozen megs of memory anymore after I restarted PostgreSQL because the checkpointer process had 2.4 gigs
of memory in use.  (The table consumes nearly 1.3 gigs of disk space, and the query returns one billion as the join used results in one billion rows.)

The query I was using is this one:

SELECT count(*) FROM pgbench_accounts a JOIN pgbench_accounts b ON (a.abalance = b.abalance);

which I ran after initializing the table using pgbench –I –s 100, and populating the abalance field with

UPDATE pgbench_accounts SET abalance = aid % 100000 + 1;

The query plan reads,

postgres=# EXPLAIN ANALYZE SELECT count(*) FROM pgbench_accounts a JOIN pgbench_accounts b ON (a.abalance = b.abalance);
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=22479359.09..22479359.10 rows=1 width=0) (actual time=787769.456..787769.456 rows=1 loops=1)
   ->  Merge Join  (cost=3126660.67..19725402.17 rows=1101582767 width=0) (actual time=62254.349..621119.691 rows=1000000000 loops=1)
         Merge Cond: (a.abalance = b.abalance)
         ->  Sort  (cost=1563330.33..1588330.33 rows=10000000 width=4) (actual time=27706.051..42305.488 rows=10000000 loops=1)
               Sort Key: a.abalance
               Sort Method: external merge  Disk: 136624kB
               ->  Seq Scan on pgbench_accounts a  (cost=0.00..263935.00 rows=10000000 width=4) (actual time=0.260..11592.979 rows=10000000 loops=1)
         ->  Materialize  (cost=1563330.33..1613330.33 rows=10000000 width=4) (actual time=34548.224..209123.713 rows=999999901 loops=1)
               ->  Sort  (cost=1563330.33..1588330.33 rows=10000000 width=4) (actual time=34548.202..43988.283 rows=10000000 loops=1)
                     Sort Key: b.abalance
                     Sort Method: external merge  Disk: 136624kB
                     ->  Seq Scan on pgbench_accounts b  (cost=0.00..263935.00 rows=10000000 width=4) (actual time=0.017..16807.894 rows=10000000 loops=1)
 Total runtime: 787836.841 ms
(13 Zeilen)

The point of the SELECT count(*) (as opposed to something like SELECT a.*) is, of course, to make sure we are not measuring the impact of the memory consumption of pgbench.

There are sorts involved, which I heard might require working memory but could also “spill to disk” as someone put it.  So how do I craft a query that actually does use lots of memory?

Cheers,
Holger Friedrich