700% faster (was: Re: Why is Postgres only using 8 cores for partitioned count? [Parallel append])

"Seamus Abshere" <sabshere@xxxxxxxxxxxxxxxxxxxx> · Sat, 13 Feb 2021 22:10:43 -0500

hi,

Apologies for the self-replying, but I made it 700% faster by compiling from source and forcing the parallel_workers from 7 to 96.

If this weren't a partitioned table, I could just do `alter table test_3pd_cstore_partitioned set (parallel_workers = 96)` - but that isn't currently supported.

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index cd3fdd259c..0057a69d4e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1544,6 +1544,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 							   max_parallel_workers_per_gather);
 		Assert(parallel_workers > 0);
 
+		// force a crazy parallelism
+		parallel_workers = 96;
+		
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
 										NIL, NULL, parallel_workers, true,

BEFORE:

postgres=# explain analyze select count(*) from test_3pd_cstore_partitioned where age > 50;

 Finalize Aggregate  (cost=558860.96..558860.97 rows=1 width=8) (actual time=8528.509..8538.627 rows=1 loops=1)
   ->  Gather  (cost=56918.65..558860.94 rows=7 width=8) (actual time=1863.462..8538.603 rows=64 loops=1)
         Workers Planned: 7
         Workers Launched: 7
         ->  Parallel Append  (cost=55918.65..557860.24 rows=1 width=8) (actual time=1877.875..8417.486 rows=8 loops=8)
[...]
 Execution Time: 8565.734 ms

AFTER:

postgres=# explain analyze select count(*) from test_3pd_cstore_partitioned where age > 50;
 Finalize Aggregate  (cost=57030.20..57030.21 rows=1 width=8) (actual time=1125.828..1129.958 rows=1 loops=1)
   ->  Gather  (cost=56611.92..57029.96 rows=96 width=8) (actual time=994.708..1129.940 rows=64 loops=1)
         Workers Planned: 96
         Workers Launched: 96
         ->  Parallel Append  (cost=55611.92..56020.36 rows=1 width=8) (actual time=656.027..656.051 rows=1 loops=97)
[...]
 Execution Time: 1133.810 ms

Should I try to submit a patch that adds support for "alter table test_3pd_cstore_partitioned set (parallel_workers = 96)" for partitioned tables?

Best,
Seamus

PS. The awesome thing about this is that I'm table scanning a 270 million row, 600 column table in 1.2 seconds, which I never thought was possible.

PPS. I have no idea why 96 worked better than 64 (the number of cores I have), but it did - 700% vs 400%.

On Sat, Feb 13, 2021, at 7:09 PM, Seamus Abshere wrote:
> hi,
> 
> I've traced this back to the formula for Parallel Append workers - 
> log2(partitions).
> 
> The comment from Robert says: (src/backend/optimizer/path/allpaths.c)
> 
> 		/*
> 		 * If the use of parallel append is permitted, always request at least
> 		 * log2(# of children) workers. 
> 
> In my case, every partition takes 1 second to scan, I have 64 cores, I 
> have 64 partitions, and the wall time is 8 seconds with 8 workers.
> 
> I assume that if it it planned significantly more workers (16? 32? even 
> 64?), it would get significantly faster (even accounting for 
> transaction cost). So why doesn't it ask for more? Note that I've set 
> max_parallel_workers=512, etc. (postgresql.conf in my first message).
> 
> Here are full plans 
> https://gist.github.com/seamusabshere/1c1d869ee24145689f45e4124d8730d3
> 
> Best,
> Seamus
> 
> On Fri, Feb 12, 2021, at 4:50 PM, Seamus Abshere wrote:
> > hi,
> > 
> > How can I convince Postgres to use more than 8 cores?
> > 
> > I've got an r6gd.16xlarge with 64 vCpus and 512gb RAM running Postgres 
> > 13.1 on Ubuntu 20.04.
> > 
> > CREATE TABLE tbl (
> >   [...]
> > ) PARTITION BY HASH (address_key);
> > 
> > It has 64 partitions - e.g. FOR VALUES WITH (modulus 64, remainder 0) etc.
> > 
> > We're running `SELECT COUNT(*) FROM tbl`.
> > 
> > I've watched top and I never see more than 8 cores going 100%.
> > 
> > Here is my (admittedly ridiculous) postgresql.conf:
> > 
> > checkpoint_completion_target = 0.9
> > data_directory='/tank/postgresql/13/main'
> > default_statistics_target = 100
> > effective_cache_size = 381696MB
> > effective_io_concurrency = 200
> > enable_partition_pruning=on
> > enable_partitionwise_aggregate=on
> > enable_partitionwise_join=on
> > listen_addresses='*'
> > maintenance_work_mem = 2GB
> > max_connections = 200
> > max_parallel_maintenance_workers = 4
> > max_parallel_workers = 512
> > max_parallel_workers_per_gather = 512
> > max_wal_size = 4GB
> > max_worker_processes = 512
> > min_wal_size = 1GB
> > random_page_cost = 1.1
> > shared_buffers = 127232MB
> > shared_preload_libraries = 'cstore_fdw'
> > synchronous_commit=off
> > wal_buffers = 16MB
> > work_mem = 1628560kB
> > 
> > Best,
> > Seamus