Re: Sort performance on large tables

"Marc Morin" <marc@xxxxxxxxxxxx> · Tue, 8 Nov 2005 08:19:44 -0500

I have run into this type of query problem as well.  I solved it in my
application by the following type of query.

SELECT tlid
FROM completechain AS o
WHERE not exists ( 
	SELECT 1
	FROM completechain
	WHERE tlid=o.tlid and ogc_fid!=o.ogc_fid
);

Assumes of course that you have an index on tlid.

> -----Original Message-----
> From: pgsql-performance-owner@xxxxxxxxxxxxxx 
> [mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of 
> Charlie Savage
> Sent: Tuesday, November 08, 2005 2:05 AM
> To: pgsql-performance@xxxxxxxxxxxxxx
> Subject:  Sort performance on large tables
> 
> Hi everyone,
> 
> I have a question about the performance of sort.
> 
> Setup: Dell Dimension 3000, Suse 10, 1GB ram, PostgreSQL 8.1 
> RC 1 with PostGIS, 1 built-in 80 GB IDE drive, 1 SATA Seagate 
> 400GB drive.  The IDE drive has the OS and the WAL files, the 
> SATA drive the database. 
>  From hdparm the max IO for the IDE drive is about 50Mb/s and 
> the SATA drive is about 65Mb/s.  Thus a very low-end machine 
> - but it used just for development (i.e., it is not a 
> production machine) and the only thing it does is run a 
> PostgresSQL database.
> 
> I have a staging table called completechain that holds US 
> tiger data (i.e., streets and addresses for the US).  The 
> table is approximately 18GB.  Its big because there is a lot 
> of data, but also because the table is not normalized (it 
> comes that way).
> 
> I want to extract data out of the file, with the most 
> important values being stored in a column called tlid.  The 
> tlid field is an integer, and the values are 98% unique.  
> There is a second column called ogc_fid which is unique (it 
> is a serial field).  I need to extract out unique TLID's 
> (doesn't matter which duplicate I get rid of).  To do this I 
> am running this query:
> 
> SELECT tlid, min(ogc_fid)
> FROM completechain
> GROUP BY tlid;
> 
> The results from explain analyze are:
> 
> "GroupAggregate  (cost=10400373.80..11361807.88 rows=48071704 
> width=8) (actual time=7311682.715..8315746.835 rows=47599910 loops=1)"
> "  ->  Sort  (cost=10400373.80..10520553.06 rows=48071704 
> width=8) (actual time=7311682.682..7972304.777 rows=48199165 loops=1)"
> "        Sort Key: tlid"
> "        ->  Seq Scan on completechain  (cost=0.00..2228584.04 
> rows=48071704 width=8) (actual time=27.514..773245.046 
> rows=48199165 loops=1)"
> "Total runtime: 8486057.185 ms"
> 	
> Doing a similar query produces the same results:
> 
> SELECT DISTINCT ON (tlid), tlid, ogc_fid FROM completechain;
> 
> Note it takes over 10 times longer to do the sort than the 
> full sequential scan.
> 
> Should I expect results like this?  I realize that the 
> computer is quite low-end and is very IO bound for this 
> query, but I'm still surprised that the sort operation takes so long.
> 
> Out of curiosity, I setup an Oracle database on the same 
> machine with the same data and ran the same query.  Oracle 
> was over an order of magnitude faster. Looking at its query 
> plan, it avoided the sort by using "HASH GROUP BY."  Does 
> such a construct exist in PostgreSQL (I see only hash joins)?
> 
> Also as an experiment I forced oracle to do a sort by running 
> this query:
> 
> SELECT tlid, min(ogc_fid)
> FROM completechain
> GROUP BY tlid
> ORDER BY tlid;
> 
> Even with this, it was more than a magnitude faster than Postgresql. 
> Which makes me think I have somehow misconfigured postgresql 
> (see the relevant parts of postgresql.conf below).
> 
> Any idea/help appreciated.
> 
> Thanks,
> 
> Charlie
> 
> 
> -------------------------------
> 
> #-------------------------------------------------------------
> --------------
> # RESOURCE USAGE (except WAL)
> #-------------------------------------------------------------
> --------------
> 
> shared_buffers = 40000                  # 40000 buffers * 8192 
> bytes/buffer = 327,680,000 bytes
> #shared_buffers = 1000			# min 16 or 
> max_connections*2, 8KB each
> 
> temp_buffers = 5000
> #temp_buffers = 1000			# min 100, 8KB each
> #max_prepared_transactions = 5		# can be 0 or more
> # note: increasing max_prepared_transactions costs ~600 bytes 
> of shared memory # per transaction slot, plus lock space (see 
> max_locks_per_transaction).
> 
> work_mem =  16384                        # in Kb
> #work_mem = 1024			# min 64, size in KB
> 
> maintenance_work_mem = 262144            # in kb
> #maintenance_work_mem = 16384		# min 1024, size in KB
> #max_stack_depth = 2048			# min 100, size in KB
> 
> # - Free Space Map -
> 
> max_fsm_pages = 60000	
> #max_fsm_pages = 20000			# min 
> max_fsm_relations*16, 6 bytes each
> 
> #max_fsm_relations = 1000		# min 100, ~70 bytes each
> 
> # - Kernel Resource Usage -
> 
> #max_files_per_process = 1000		# min 25
> #preload_libraries = ''
> 
> # - Cost-Based Vacuum Delay -
> 
> #vacuum_cost_delay = 0			# 0-1000 milliseconds
> #vacuum_cost_page_hit = 1		# 0-10000 credits
> #vacuum_cost_page_miss = 10		# 0-10000 credits
> #vacuum_cost_page_dirty = 20		# 0-10000 credits
> #vacuum_cost_limit = 200		# 0-10000 credits
> 
> # - Background writer -
> 
> #bgwriter_delay = 200			# 10-10000 milliseconds 
> between rounds
> #bgwriter_lru_percent = 1.0		# 0-100% of LRU buffers 
> scanned/round
> #bgwriter_lru_maxpages = 5		# 0-1000 buffers max 
> written/round
> #bgwriter_all_percent = 0.333		# 0-100% of all buffers 
> scanned/round
> #bgwriter_all_maxpages = 5		# 0-1000 buffers max 
> written/round
> 
> 
> #-------------------------------------------------------------
> --------------
> # WRITE AHEAD LOG
> #-------------------------------------------------------------
> --------------
> 
> # - Settings -
> 
> fsync = on				# turns forced 
> synchronization on or off
> #wal_sync_method = fsync		# the default is the 
> first option
> 					# supported by the 
> operating system:
> 					#   open_datasync
> 					#   fdatasync
> 					#   fsync
> 					#   fsync_writethrough
> 					#   open_sync
> #full_page_writes = on			# recover from 
> partial page writes
> 
> wal_buffers = 128
> #wal_buffers = 8			# min 4, 8KB each
> 
> #commit_delay = 0			# range 0-100000, in 
> microseconds
> #commit_siblings = 5			# range 1-1000
> 
> # - Checkpoints -
> 
> checkpoint_segments = 256               # 256 * 16Mb = 
> 4,294,967,296 bytes
> checkpoint_timeout = 1200		# 1200 seconds (20 minutes)
> checkpoint_warning = 30			# in seconds, 0 is off
> 
> #checkpoint_segments = 3		# in logfile segments, 
> min 1, 16MB each
> #checkpoint_timeout = 300		# range 30-3600, in seconds
> #checkpoint_warning = 30		# in seconds, 0 is off
> 
> # - Archiving -
> 
> #archive_command = ''			# command to use to 
> archive a logfile
> 					# segment
> 
> 
> #-------------------------------------------------------------
> --------------
> # QUERY TUNING
> #-------------------------------------------------------------
> --------------
> 
> # - Planner Method Configuration -
> 
> #enable_bitmapscan = on
> #enable_hashagg = on
> #enable_hashjoin = on
> #enable_indexscan = on
> #enable_mergejoin = on
> #enable_nestloop = on
> #enable_seqscan = on
> #enable_sort = on
> #enable_tidscan = on
> 
> # - Planner Cost Constants -
> 
> effective_cache_size = 80000		# 80000 * 8192 = 
> 655,360,000 bytes
> #effective_cache_size = 1000		# typically 8KB each
> 
> random_page_cost = 2.5			# units are one 
> sequential page fetch
> #random_page_cost = 4			# units are one 
> sequential page fetch
> 					# cost
> #cpu_tuple_cost = 0.01			# (same)
> #cpu_index_tuple_cost = 0.001		# (same)
> #cpu_operator_cost = 0.0025		# (same)
> 
> # - Genetic Query Optimizer -
> 
> #geqo = on
> #geqo_threshold = 12
> #geqo_effort = 5			# range 1-10
> #geqo_pool_size = 0			# selects default based 
> on effort
> #geqo_generations = 0			# selects default based 
> on effort
> #geqo_selection_bias = 2.0		# range 1.5-2.0
> 
> # - Other Planner Options -
> 
> default_statistics_target = 100		# range 1-1000
> #default_statistics_target = 10		# range 1-1000
> #constraint_exclusion = off
> #from_collapse_limit = 8
> #join_collapse_limit = 8		# 1 disables collapsing 
> of explicit
> 					# JOINs
> 
> 
> #-------------------------------------------------------------
> --------------
> #-------------------------------------------------------------
> --------------
> # RUNTIME STATISTICS
> #-------------------------------------------------------------
> --------------
> 
> # - Statistics Monitoring -
> 
> #log_parser_stats = off
> #log_planner_stats = off
> #log_executor_stats = off
> #log_statement_stats = off
> 
> # - Query/Index Statistics Collector -
> 
> stats_start_collector = on
> stats_command_string = on
> stats_block_level = on
> stats_row_level = on
> 
> #stats_start_collector = on
> #stats_command_string = off
> #stats_block_level = off
> #stats_row_level = off
> #stats_reset_on_server_start = off
> 
> 
> #-------------------------------------------------------------
> --------------
> # AUTOVACUUM PARAMETERS
> #-------------------------------------------------------------
> --------------
> 
> autovacuum = true
> autovacuum_naptime = 600
> 
> #autovacuum = false			# enable autovacuum subprocess?
> #autovacuum_naptime = 60		# time between 
> autovacuum runs, in secs
> #autovacuum_vacuum_threshold = 1000	# min # of tuple updates before
> 					# vacuum
> #autovacuum_analyze_threshold = 500	# min # of tuple updates before
> 					# analyze
> #autovacuum_vacuum_scale_factor = 0.4	# fraction of rel size before
> 					# vacuum
> #autovacuum_analyze_scale_factor = 0.2	# fraction of 
> rel size before
> 					# analyze
> #autovacuum_vacuum_cost_delay = -1	# default vacuum cost delay for
> 					# autovac, -1 means use
> 					# vacuum_cost_delay
> #autovacuum_vacuum_cost_limit = -1	# default vacuum cost limit for
> 					# autovac, -1 means use
> 					# vacuum_cost_
> 
> 
> ----------------------
> 
> CREATE TABLE tiger.completechain
> (
>    ogc_fid int4 NOT NULL DEFAULT
> nextval('completechain_ogc_fid_seq'::regclass),
>    module varchar(8) NOT NULL,
>    tlid int4 NOT NULL,
>    side1 int4,
>    source varchar(1) NOT NULL,
>    fedirp varchar(2),
>    fename varchar(30),
>    fetype varchar(4),
>    fedirs varchar(2),
>    cfcc varchar(3) NOT NULL,
>    fraddl varchar(11),
>    toaddl varchar(11),
>    fraddr varchar(11),
>    toaddr varchar(11),
>    friaddl varchar(1),
>    toiaddl varchar(1),
>    friaddr varchar(1),
>    toiaddr varchar(1),
>    zipl int4,
>    zipr int4,
>    aianhhfpl int4,
>    aianhhfpr int4,
>    aihhtlil varchar(1),
>    aihhtlir varchar(1),
>    census1 varchar(1),
>    census2 varchar(1),
>    statel int4,
>    stater int4,
>    countyl int4,
>    countyr int4,
>    cousubl int4,
>    cousubr int4,
>    submcdl int4,
>    submcdr int4,
>    placel int4,
>    placer int4,
>    tractl int4,
>    tractr int4,
>    blockl int4,
>    blockr int4,
>    wkb_geometry public.geometry NOT NULL,
>    CONSTRAINT enforce_dims_wkb_geometry CHECK 
> (ndims(wkb_geometry) = 2),
>    CONSTRAINT enforce_geotype_wkb_geometry CHECK
> (geometrytype(wkb_geometry) = 'LINESTRING'::text OR 
> wkb_geometry IS NULL),
>    CONSTRAINT enforce_srid_wkb_geometry CHECK 
> (srid(wkb_geometry) = 4269)
> )
> WITHOUT OIDS;
> ALTER TABLE tiger.completechain OWNER TO postgres;
> 
> 
> 
> 
> 
> 
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 
> 

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings