Re: Sort performance on large tables

Charlie Savage <cfis@xxxxxxxxxxxxx> · Tue, 08 Nov 2005 22:49:43 -0700

Very interesting technique.  It doesn't actually do quite what I want 
since it returns all rows that do not have duplicates and not a complete 
list of unique tlid values.  But I could massage it to do what I want.

Anyway, the timing:

"Seq Scan on completechain t1  (cost=0.00..218139733.60 rows=24099582 
width=4) (actual time=25.890..3404650.452 rows=47000655 loops=1)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Index Scan using idx_completechain_tlid on completechain t2 
(cost=0.00..4.48 rows=1 width=0) (actual time=0.059..0.059 rows=0 
loops=48199165)"
"          Index Cond: ($0 = tlid)"
"          Filter: ($1 <> ogc_fid)"
"Total runtime: 3551423.162 ms"
Marc Morin wrote:

So a 60% reduction in time.  Thanks again for the tip.

Charlie

I have run into this type of query problem as well.  I solved it in my
application by the following type of query.

Assumes of course that you have an index on tlid.

-----Original Message-----
From: pgsql-performance-owner@xxxxxxxxxxxxxx 
[mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of 
Charlie Savage
Sent: Tuesday, November 08, 2005 2:05 AM
To: pgsql-performance@xxxxxxxxxxxxxx
Subject:  Sort performance on large tables

Hi everyone,

I have a question about the performance of sort.

Setup: Dell Dimension 3000, Suse 10, 1GB ram, PostgreSQL 8.1 
RC 1 with PostGIS, 1 built-in 80 GB IDE drive, 1 SATA Seagate 
400GB drive.  The IDE drive has the OS and the WAL files, the 
SATA drive the database. 
 From hdparm the max IO for the IDE drive is about 50Mb/s and 
the SATA drive is about 65Mb/s.  Thus a very low-end machine 
- but it used just for development (i.e., it is not a 
production machine) and the only thing it does is run a 
PostgresSQL database.

I have a staging table called completechain that holds US 
tiger data (i.e., streets and addresses for the US).  The 
table is approximately 18GB.  Its big because there is a lot 
of data, but also because the table is not normalized (it 
comes that way).

I want to extract data out of the file, with the most 
important values being stored in a column called tlid.  The 
tlid field is an integer, and the values are 98% unique.  
There is a second column called ogc_fid which is unique (it 
is a serial field).  I need to extract out unique TLID's 
(doesn't matter which duplicate I get rid of).  To do this I 
am running this query:

SELECT tlid, min(ogc_fid)
FROM completechain
GROUP BY tlid;

The results from explain analyze are:

"GroupAggregate  (cost=10400373.80..11361807.88 rows=48071704 
width=8) (actual time=7311682.715..8315746.835 rows=47599910 loops=1)"
"  ->  Sort  (cost=10400373.80..10520553.06 rows=48071704 
width=8) (actual time=7311682.682..7972304.777 rows=48199165 loops=1)"
"        Sort Key: tlid"
"        ->  Seq Scan on completechain  (cost=0.00..2228584.04 
rows=48071704 width=8) (actual time=27.514..773245.046 
rows=48199165 loops=1)"
"Total runtime: 8486057.185 ms"

Doing a similar query produces the same results:

SELECT DISTINCT ON (tlid), tlid, ogc_fid FROM completechain;

Note it takes over 10 times longer to do the sort than the 
full sequential scan.

Should I expect results like this?  I realize that the 
computer is quite low-end and is very IO bound for this 
query, but I'm still surprised that the sort operation takes so long.

Out of curiosity, I setup an Oracle database on the same 
machine with the same data and ran the same query.  Oracle 
was over an order of magnitude faster. Looking at its query 
plan, it avoided the sort by using "HASH GROUP BY."  Does 
such a construct exist in PostgreSQL (I see only hash joins)?

Also as an experiment I forced oracle to do a sort by running 
this query:

SELECT tlid, min(ogc_fid)
FROM completechain
GROUP BY tlid
ORDER BY tlid;

Even with this, it was more than a magnitude faster than Postgresql. 
Which makes me think I have somehow misconfigured postgresql 
(see the relevant parts of postgresql.conf below).

Any idea/help appreciated.

Thanks,

Charlie

-------------------------------

#-------------------------------------------------------------
--------------
# RESOURCE USAGE (except WAL)
#-------------------------------------------------------------
--------------

shared_buffers = 40000                  # 40000 buffers * 8192 
bytes/buffer = 327,680,000 bytes
#shared_buffers = 1000			# min 16 or 
max_connections*2, 8KB each

temp_buffers = 5000
#temp_buffers = 1000			# min 100, 8KB each
#max_prepared_transactions = 5		# can be 0 or more
# note: increasing max_prepared_transactions costs ~600 bytes 
of shared memory # per transaction slot, plus lock space (see 
max_locks_per_transaction).

work_mem =  16384                        # in Kb
#work_mem = 1024			# min 64, size in KB

maintenance_work_mem = 262144            # in kb
#maintenance_work_mem = 16384		# min 1024, size in KB
#max_stack_depth = 2048			# min 100, size in KB

# - Free Space Map -

max_fsm_pages = 60000	
#max_fsm_pages = 20000			# min 
max_fsm_relations*16, 6 bytes each

#max_fsm_relations = 1000		# min 100, ~70 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000		# min 25
#preload_libraries = ''

# - Cost-Based Vacuum Delay -

#vacuum_cost_delay = 0			# 0-1000 milliseconds
#vacuum_cost_page_hit = 1		# 0-10000 credits
#vacuum_cost_page_miss = 10		# 0-10000 credits
#vacuum_cost_page_dirty = 20		# 0-10000 credits
#vacuum_cost_limit = 200		# 0-10000 credits

# - Background writer -

#bgwriter_delay = 200			# 10-10000 milliseconds 
between rounds
#bgwriter_lru_percent = 1.0		# 0-100% of LRU buffers 
scanned/round
#bgwriter_lru_maxpages = 5		# 0-1000 buffers max 
written/round
#bgwriter_all_percent = 0.333		# 0-100% of all buffers 
scanned/round
#bgwriter_all_maxpages = 5		# 0-1000 buffers max 
written/round

#-------------------------------------------------------------
--------------
# WRITE AHEAD LOG
#-------------------------------------------------------------
--------------

# - Settings -

fsync = on				# turns forced 
synchronization on or off
#wal_sync_method = fsync		# the default is the 
first option
					# supported by the 
operating system:
					#   open_datasync
					#   fdatasync
					#   fsync
					#   fsync_writethrough
					#   open_sync
#full_page_writes = on			# recover from 
partial page writes

wal_buffers = 128
#wal_buffers = 8			# min 4, 8KB each

#commit_delay = 0			# range 0-100000, in 
microseconds
#commit_siblings = 5			# range 1-1000

# - Checkpoints -

checkpoint_segments = 256               # 256 * 16Mb = 
4,294,967,296 bytes
checkpoint_timeout = 1200		# 1200 seconds (20 minutes)
checkpoint_warning = 30			# in seconds, 0 is off

#checkpoint_segments = 3		# in logfile segments, 
min 1, 16MB each
#checkpoint_timeout = 300		# range 30-3600, in seconds
#checkpoint_warning = 30		# in seconds, 0 is off

# - Archiving -

#archive_command = ''			# command to use to 
archive a logfile
					# segment

#-------------------------------------------------------------
--------------
# QUERY TUNING
#-------------------------------------------------------------
--------------

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

effective_cache_size = 80000		# 80000 * 8192 = 
655,360,000 bytes
#effective_cache_size = 1000		# typically 8KB each

random_page_cost = 2.5			# units are one 
sequential page fetch
#random_page_cost = 4			# units are one 
sequential page fetch
					# cost
#cpu_tuple_cost = 0.01			# (same)
#cpu_index_tuple_cost = 0.001		# (same)
#cpu_operator_cost = 0.0025		# (same)

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5			# range 1-10
#geqo_pool_size = 0			# selects default based 
on effort
#geqo_generations = 0			# selects default based 
on effort
#geqo_selection_bias = 2.0		# range 1.5-2.0

# - Other Planner Options -

default_statistics_target = 100		# range 1-1000
#default_statistics_target = 10		# range 1-1000
#constraint_exclusion = off
#from_collapse_limit = 8
#join_collapse_limit = 8		# 1 disables collapsing 
of explicit
					# JOINs

#-------------------------------------------------------------
--------------
#-------------------------------------------------------------
--------------
# RUNTIME STATISTICS
#-------------------------------------------------------------
--------------

# - Statistics Monitoring -

#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off

# - Query/Index Statistics Collector -

stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on

#stats_start_collector = on
#stats_command_string = off
#stats_block_level = off
#stats_row_level = off
#stats_reset_on_server_start = off

#-------------------------------------------------------------
--------------
# AUTOVACUUM PARAMETERS
#-------------------------------------------------------------
--------------

autovacuum = true
autovacuum_naptime = 600

#autovacuum = false			# enable autovacuum subprocess?
#autovacuum_naptime = 60		# time between 
autovacuum runs, in secs
#autovacuum_vacuum_threshold = 1000	# min # of tuple updates before
					# vacuum
#autovacuum_analyze_threshold = 500	# min # of tuple updates before
					# analyze
#autovacuum_vacuum_scale_factor = 0.4	# fraction of rel size before
					# vacuum
#autovacuum_analyze_scale_factor = 0.2	# fraction of 
rel size before
					# analyze
#autovacuum_vacuum_cost_delay = -1	# default vacuum cost delay for
					# autovac, -1 means use
					# vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1	# default vacuum cost limit for
					# autovac, -1 means use
					# vacuum_cost_

----------------------

CREATE TABLE tiger.completechain
(
   ogc_fid int4 NOT NULL DEFAULT
nextval('completechain_ogc_fid_seq'::regclass),
   module varchar(8) NOT NULL,
   tlid int4 NOT NULL,
   side1 int4,
   source varchar(1) NOT NULL,
   fedirp varchar(2),
   fename varchar(30),
   fetype varchar(4),
   fedirs varchar(2),
   cfcc varchar(3) NOT NULL,
   fraddl varchar(11),
   toaddl varchar(11),
   fraddr varchar(11),
   toaddr varchar(11),
   friaddl varchar(1),
   toiaddl varchar(1),
   friaddr varchar(1),
   toiaddr varchar(1),
   zipl int4,
   zipr int4,
   aianhhfpl int4,
   aianhhfpr int4,
   aihhtlil varchar(1),
   aihhtlir varchar(1),
   census1 varchar(1),
   census2 varchar(1),
   statel int4,
   stater int4,
   countyl int4,
   countyr int4,
   cousubl int4,
   cousubr int4,
   submcdl int4,
   submcdr int4,
   placel int4,
   placer int4,
   tractl int4,
   tractr int4,
   blockl int4,
   blockr int4,
   wkb_geometry public.geometry NOT NULL,
   CONSTRAINT enforce_dims_wkb_geometry CHECK 
(ndims(wkb_geometry) = 2),
   CONSTRAINT enforce_geotype_wkb_geometry CHECK
(geometrytype(wkb_geometry) = 'LINESTRING'::text OR 
wkb_geometry IS NULL),
   CONSTRAINT enforce_srid_wkb_geometry CHECK 
(srid(wkb_geometry) = 4269)
)
WITHOUT OIDS;
ALTER TABLE tiger.completechain OWNER TO postgres;

---------------------------(end of 
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match