Fwd: Helping planner to chose sequential scan when it improves performance

"Patrick O'Toole" <patrick.otoole@xxxxxxxxx> · Sun, 25 Jun 2023 00:29:00 -0600

Hello! I tried asking this over on the general listserv before realizing pgsql-performance is probably better suited.

Hi all,

I recently started at a new firm and have been trying to help to grok certain planner behavior. A strip-down example of the sort of join we do in the database looks like this, wherein we join two tables that have about 1 million rows:

-- VACUUM (FULL, VERBOSE, ANALYZE), run the query twice first, then...
EXPLAIN(ANALYZE, VERBOSE, COSTS, SETTINGS, BUFFERS, WAL, TIMING, SUMMARY)
SELECT
    ci.conversation_uuid,
    ci.item_uuid,
    ci.tenant_id,
    it.item_subject,
    it.item_body_start
FROM
    conversation_item AS ci
	INNER JOIN item_text AS it ON it.item_uuid = ci.item_uuid;

-- The necessary DDL that creates these tables and indexes is attached. I've commented out some extra stuff that isn't directly related to the above query.

Depending on config, we get different results in terms of performance (EXPLAIN output attached):

PLAN A (default config, effective cache size just shy of 15GB): 3.829 seconds. A nested loop is used to probe the hash index `conversation_item_item_hash_index` for each row of item_text. Although the cost of probing once is low, a fair amount of time passes because the operation is repeated ~1.3 million times.

PLAN B (enable_indexscan off, effective cache same as before): 3.254 seconds (~15% speedup, sometimes 30%). Both tables are scanned sequentially and conversation_item is hashed before results are combined with a hash join.

PLAN C: (random_page_cost = 8.0, instead of default 4, effective cache same as before): 2.959 (~23% speedup, sometimes 38%). Same overall plan as PLAN B, some differences in buffers and I/O. I'll note we had to get to 8.0 before we saw a change to planner behavior; 5.0, 6.0, and 7.0 were too low to make a difference.

Environment:

Postgres 15.2
Amazon RDS — db.m6g.2xlarge

Questions:
In Plan A, what factors (read: which GUC settings) are causing the planner to select a substantially slower plan despite having recent stats about number of rows?
Is there a substantial difference between the on-the-fly hash done in Plan B and Plan C compared to the hash-index used in Plan A? Can I assume they are essentially the same? Perhaps there are there differences in how they're applied?
Is it common to see values for random_page_cost set as high as 8.0? We would of course need to investigate whether we see a net positive or net negative impact on other queries, to adopt this as a general setting, but is it a proposal we should actually consider?
Maybe we are barking up the wrong tree with the previous questions. Are there other configuration parameters we should consider first to improve performance in situations like the one illustrated?
Are there other problems with our schema, query, or plans shown here? Other approaches (or tools/analyses) we should consider?
Attached also are the results of
SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');
- Patrick

Hash Join  (cost=53933.69..255802.87 rows=1395542 width=216) (actual time=422.466..2868.948 rows=1394633 loops=1)"
  Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id, it.item_subject, it.item_body_start"
  Inner Unique: true"
  Hash Cond: (it.item_uuid = ci.item_uuid)"
  Buffers: shared hit=12283 read=95974 dirtied=77, temp read=44113 written=44113"
  I/O Timings: shared/local read=462.195, temp read=86.413 write=423.363"
  WAL: records=1 bytes=58"
  ->  Seq Scan on item_text it  (cost=0.00..110641.18 rows=1401518 width=196) (actual time=0.012..854.842 rows=1394633 loops=1)"
        Output: it.item_subject, it.item_body_start, it.item_uuid"
        Buffers: shared hit=652 read=95974 dirtied=77"
        I/O Timings: shared/local read=462.195"
        WAL: records=1 bytes=58"
  ->  Hash  (cost=25586.42..25586.42 rows=1395542 width=36) (actual time=422.386..422.387 rows=1394633 loops=1)"
        Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id"
        Buckets: 131072  Batches: 16  Memory Usage: 6799kB"
        Buffers: shared hit=11631, temp written=8932"
        I/O Timings: temp write=78.785"
        ->  Seq Scan on conversation_item ci  (cost=0.00..25586.42 rows=1395542 width=36) (actual time=0.004..133.304 rows=1394633 loops=1)"
              Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id"
              Buffers: shared hit=11631"
Settings: effective_cache_size = '16053152kB', jit = 'off', random_page_cost = '8'"
Query Identifier: 5005100605804837348"
Planning:"
  Buffers: shared hit=398 read=7"
  I/O Timings: shared/local read=1.400"
Planning Time: 2.379 ms"
Execution Time: 2956.264 ms"

Generated with:
	EXPLAIN(ANALYZE, VERBOSE, COSTS, SETTINGS, BUFFERS, WAL, TIMING, SUMMARY)
	SELECT
		ci.conversation_uuid,
		ci.item_uuid,
		ci.tenant_id,
		it.item_subject,
		it.item_body_start
	FROM
		conversation_item AS ci
		INNER JOIN item_text AS it ON it.item_uuid = ci.item_uuid;
Hash Join  (cost=53126.96..251632.49 rows=1374665 width=216) (actual time=475.472..3162.922 rows=1374687 loops=1)"
  Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id, it.item_subject, it.item_body_start"
  Inner Unique: true"
  Hash Cond: (it.item_uuid = ci.item_uuid)"
  Buffers: shared hit=1376 read=105087 written=8220, temp read=43290 written=43290"
  I/O Timings: shared/local read=319.560 write=38.895, temp read=92.529 write=420.118"
  ->  Seq Scan on item_text it  (cost=0.00..108786.21 rows=1378021 width=196) (actual time=0.012..730.051 rows=1374687 loops=1)"
        Output: it.item_subject, it.item_body_start, it.item_uuid"
        Buffers: shared hit=1238 read=93768 written=5951"
        I/O Timings: shared/local read=289.139 write=29.014"
  ->  Hash  (cost=25203.65..25203.65 rows=1374665 width=36) (actual time=475.372..475.373 rows=1374687 loops=1)"
        Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id"
        Buckets: 131072  Batches: 16  Memory Usage: 6718kB"
        Buffers: shared hit=138 read=11319 written=2269, temp written=8804"
        I/O Timings: shared/local read=30.421 write=9.881, temp write=80.814"
        ->  Seq Scan on conversation_item ci  (cost=0.00..25203.65 rows=1374665 width=36) (actual time=0.008..184.562 rows=1374687 loops=1)"
              Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id"
              Buffers: shared hit=138 read=11319 written=2269"
              I/O Timings: shared/local read=30.421 write=9.881"
Settings: effective_cache_size = '16053152kB', enable_indexscan = 'off', jit = 'off'"
Query Identifier: 5005100605804837348"
Planning:"
  Buffers: shared hit=379 read=12"
  I/O Timings: shared/local read=0.810"
Planning Time: 1.724 ms"
Execution Time: 3254.486 ms"

Generated with:
	EXPLAIN(ANALYZE, VERBOSE, COSTS, SETTINGS, BUFFERS, WAL, TIMING, SUMMARY)
	SELECT
		ci.conversation_uuid,
		ci.item_uuid,
		ci.tenant_id,
		it.item_subject,
		it.item_body_start
	FROM
		conversation_item AS ci
		INNER JOIN item_text AS it ON it.item_uuid = ci.item_uuid;
Nested Loop  (cost=0.00..194705.88 rows=1340602 width=213) (actual time=0.018..3739.905 rows=1339642 loops=1)"
  Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id, it.item_subject, it.item_body_start"
  Inner Unique: true"
  Buffers: shared hit=2681522 read=90343 dirtied=203"
  I/O Timings: shared/local read=233.152"
  WAL: records=203 fpi=203 bytes=428058"
  ->  Seq Scan on item_text it  (cost=0.00..105892.07 rows=1350507 width=193) (actual time=0.011..628.038 rows=1339642 loops=1)"
        Output: it.item_subject, it.item_body_start, it.item_uuid"
        Buffers: shared hit=2044 read=90343 dirtied=193"
        I/O Timings: shared/local read=233.152"
        WAL: records=193 fpi=193 bytes=380966"
  ->  Index Scan using conversation_item_item_hash_index on conversation_item ci  (cost=0.00..0.07 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=1339642)"
        Output: ci.conversation_uuid, ci.item_uuid, ci.tenant_id"
        Index Cond: (ci.item_uuid = it.item_uuid)"
        Rows Removed by Index Recheck: 0"
        Buffers: shared hit=2679478 dirtied=10"
        WAL: records=10 fpi=10 bytes=47092"
Settings: effective_cache_size = '16053152kB', jit = 'off'"
Query Identifier: 5005100605804837348"
Planning:"
  Buffers: shared hit=12 read=4"
  I/O Timings: shared/local read=0.015"
Planning Time: 0.236 ms"
Execution Time: 3829.002 ms"

Generated with:
	EXPLAIN(ANALYZE, VERBOSE, COSTS, SETTINGS, BUFFERS, WAL, TIMING, SUMMARY)
	SELECT
		ci.conversation_uuid,
		ci.item_uuid,
		ci.tenant_id,
		it.item_subject,
		it.item_body_start
	FROM
		conversation_item AS ci
		INNER JOIN item_text AS it ON it.item_uuid = ci.item_uuid;

Attachment:
create_statements.postgre.sql

Description: application/sql
name,current_setting,source
archive_library,rds_archive,configuration file
archive_mode,on,configuration file
archive_timeout,5min,configuration file
autovacuum_analyze_scale_factor,0.05,configuration file
autovacuum_max_workers,3,configuration file
autovacuum_naptime,15s,configuration file
autovacuum_vacuum_cost_limit,600,configuration file
autovacuum_vacuum_scale_factor,0.1,configuration file
autovacuum_work_mem,1003322kB,configuration file
checkpoint_completion_target,0.9,configuration file
compute_query_id,auto,configuration file
default_toast_compression,lz4,configuration file
effective_cache_size,16053152kB,configuration file
fsync,on,configuration file
full_page_writes,on,configuration file
hot_standby,off,configuration file
hot_standby_feedback,on,configuration file
huge_pages,on,configuration file
idle_in_transaction_session_timeout,1d,configuration file
ignore_invalid_pages,off,configuration file
jit,off,configuration file
listen_addresses,*,command line
lo_compat_privileges,off,configuration file
log_autovacuum_min_duration,10s,configuration file
log_checkpoints,on,configuration file
log_destination,stderr,configuration file
log_file_mode,644,configuration file
log_line_prefix,%t:%r:%u@%d:[%p]:,configuration file
log_min_duration_statement,4s,configuration file
log_rotation_age,1h,configuration file
log_temp_files,0,configuration file
log_truncate_on_rotation,off,configuration file
logging_collector,on,configuration file
maintenance_work_mem,526332kB,configuration file
max_connections,3449,configuration file
max_locks_per_transaction,64,configuration file
max_parallel_workers,8,configuration file
max_prepared_transactions,0,configuration file
max_replication_slots,20,configuration file
max_stack_depth,6MB,configuration file
max_standby_archive_delay,5min,configuration file
max_standby_streaming_delay,5min,configuration file
max_wal_senders,20,configuration file
max_wal_size,2GB,configuration file
max_worker_processes,16,configuration file
min_wal_size,192MB,configuration file
pg_stat_statements.track,all,configuration file
rds.force_autovacuum_logging_level,warning,configuration file
rds.internal_databases,"rdsadmin,template0",configuration file
rds.logical_replication,off,configuration file
rds.rds_superuser_reserved_connections,2,configuration file
recovery_init_sync_method,syncfs,configuration file
recovery_prefetch,off,configuration file
remove_temp_files_after_crash,off,configuration file
shared_buffers,8026576kB,configuration file
ssl,on,configuration file
ssl_ca_file,/rdsdbdata/rds-metadata/ca-cert.pem,configuration file
ssl_cert_file,/rdsdbdata/rds-metadata/server-cert.pem,configuration file
ssl_key_file,/rdsdbdata/rds-metadata/server-key.pem,configuration file
superuser_reserved_connections,3,configuration file
synchronous_commit,on,configuration file
track_activity_query_size,4kB,configuration file
track_functions,pl,configuration file
track_io_timing,on,configuration file
track_wal_io_timing,on,configuration file
unix_socket_group,rdsdb,configuration file
update_process_title,on,configuration file
vacuum_cost_page_miss,5,configuration file
vacuum_failsafe_age,1200000000,configuration file
vacuum_multixact_failsafe_age,1200000000,configuration file
wal_compression,zstd,configuration file
wal_keep_size,2GB,configuration file
wal_level,replica,configuration file
wal_receiver_create_temp_slot,off,configuration file
wal_receiver_timeout,30s,configuration file
wal_sender_timeout,30s,configuration file