hi, Apologies for the self-replying, but I made it 700% faster by compiling from source and forcing the parallel_workers from 7 to 96. If this weren't a partitioned table, I could just do `alter table test_3pd_cstore_partitioned set (parallel_workers = 96)` - but that isn't currently supported. diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index cd3fdd259c..0057a69d4e 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -1544,6 +1544,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, max_parallel_workers_per_gather); Assert(parallel_workers > 0); + // force a crazy parallelism + parallel_workers = 96; + appendpath = create_append_path(root, rel, pa_nonpartial_subpaths, pa_partial_subpaths, NIL, NULL, parallel_workers, true, BEFORE: postgres=# explain analyze select count(*) from test_3pd_cstore_partitioned where age > 50; Finalize Aggregate (cost=558860.96..558860.97 rows=1 width=8) (actual time=8528.509..8538.627 rows=1 loops=1) -> Gather (cost=56918.65..558860.94 rows=7 width=8) (actual time=1863.462..8538.603 rows=64 loops=1) Workers Planned: 7 Workers Launched: 7 -> Parallel Append (cost=55918.65..557860.24 rows=1 width=8) (actual time=1877.875..8417.486 rows=8 loops=8) [...] Execution Time: 8565.734 ms AFTER: postgres=# explain analyze select count(*) from test_3pd_cstore_partitioned where age > 50; Finalize Aggregate (cost=57030.20..57030.21 rows=1 width=8) (actual time=1125.828..1129.958 rows=1 loops=1) -> Gather (cost=56611.92..57029.96 rows=96 width=8) (actual time=994.708..1129.940 rows=64 loops=1) Workers Planned: 96 Workers Launched: 96 -> Parallel Append (cost=55611.92..56020.36 rows=1 width=8) (actual time=656.027..656.051 rows=1 loops=97) [...] Execution Time: 1133.810 ms Should I try to submit a patch that adds support for "alter table test_3pd_cstore_partitioned set (parallel_workers = 96)" for partitioned tables? Best, Seamus PS. The awesome thing about this is that I'm table scanning a 270 million row, 600 column table in 1.2 seconds, which I never thought was possible. PPS. I have no idea why 96 worked better than 64 (the number of cores I have), but it did - 700% vs 400%. On Sat, Feb 13, 2021, at 7:09 PM, Seamus Abshere wrote: > hi, > > I've traced this back to the formula for Parallel Append workers - > log2(partitions). > > The comment from Robert says: (src/backend/optimizer/path/allpaths.c) > > /* > * If the use of parallel append is permitted, always request at least > * log2(# of children) workers. > > In my case, every partition takes 1 second to scan, I have 64 cores, I > have 64 partitions, and the wall time is 8 seconds with 8 workers. > > I assume that if it it planned significantly more workers (16? 32? even > 64?), it would get significantly faster (even accounting for > transaction cost). So why doesn't it ask for more? Note that I've set > max_parallel_workers=512, etc. (postgresql.conf in my first message). > > Here are full plans > https://gist.github.com/seamusabshere/1c1d869ee24145689f45e4124d8730d3 > > Best, > Seamus > > On Fri, Feb 12, 2021, at 4:50 PM, Seamus Abshere wrote: > > hi, > > > > How can I convince Postgres to use more than 8 cores? > > > > I've got an r6gd.16xlarge with 64 vCpus and 512gb RAM running Postgres > > 13.1 on Ubuntu 20.04. > > > > CREATE TABLE tbl ( > > [...] > > ) PARTITION BY HASH (address_key); > > > > It has 64 partitions - e.g. FOR VALUES WITH (modulus 64, remainder 0) etc. > > > > We're running `SELECT COUNT(*) FROM tbl`. > > > > I've watched top and I never see more than 8 cores going 100%. > > > > Here is my (admittedly ridiculous) postgresql.conf: > > > > checkpoint_completion_target = 0.9 > > data_directory='/tank/postgresql/13/main' > > default_statistics_target = 100 > > effective_cache_size = 381696MB > > effective_io_concurrency = 200 > > enable_partition_pruning=on > > enable_partitionwise_aggregate=on > > enable_partitionwise_join=on > > listen_addresses='*' > > maintenance_work_mem = 2GB > > max_connections = 200 > > max_parallel_maintenance_workers = 4 > > max_parallel_workers = 512 > > max_parallel_workers_per_gather = 512 > > max_wal_size = 4GB > > max_worker_processes = 512 > > min_wal_size = 1GB > > random_page_cost = 1.1 > > shared_buffers = 127232MB > > shared_preload_libraries = 'cstore_fdw' > > synchronous_commit=off > > wal_buffers = 16MB > > work_mem = 1628560kB > > > > Best, > > Seamus