> -----Original Message----- > From: Andreas Kretschmer <andreas@xxxxxxxxxxxxxxx> > Sent: Friday, January 25, 2019 00:55 > To: pgsql-performance@xxxxxxxxxxxxxxxxxxxx > Subject: Re: Zero throughput on a query on a very large table. > > > > Am 25.01.19 um 06:20 schrieb ldh@xxxxxxxxxxxxxxxxxx: > > > > Hello, > > > > We have been stuck for the past week on a query that simply won’t > > “execute”. We have a table with 1.2B rows that took around 14h to > > load, but a simple select takes forever and after 10h, no records are > > coming through still. > > > > Environment: > > > > - Table tmp_outpatient_rev with 41 VARCHAR columns > > (desy_sort_key, claim_no, clm_line_num, clm_thru_dt, nch_clm_type_cd, > > rev_cntr, rev_cntr_dt, …) > > > > - 1.2B rows (Billion with a ‘B’) > > > > - A single Unique Index on columns desy_sort_key, claim_no, > > clm_line_num > > > > - select pg_size_pretty(pg_relation_size('tmp_outpatient_rev')) > > --> 215GB > > > > - Database Server: 64GB, 8 cores/16 threads, HDDs 10K > > > > - Linux > > > > - PG 11.1 > > > > Query: > > > > select * from tmp_outpatient_rev order by desy_sort_key, claim_no > > > > Plan: > > > > Gather Merge (cost=61001461.16..216401602.29 rows=1242732290 > > width=250) > > > > Output: desy_sort_key, claim_no, clm_line_num, clm_thru_dt, > > nch_clm_type_cd, rev_cntr, rev_cntr_dt, … > > > > Workers Planned: 10 > > > > -> Sort (cost=61000460.97..61311144.04 rows=124273229 > > width=250) > > > > Output: desy_sort_key, claim_no, clm_line_num, > > clm_thru_dt, nch_clm_type_cd, rev_cntr, rev_cntr_dt, … > > > > Sort Key: tmp_outpatient_rev.desy_sort_key, > > tmp_outpatient_rev.claim_no > > > > -> Parallel Seq Scan on public.tmp_outpatient_rev > > (cost=0.00..29425910.29 rows=124273229 width=250) > > > > Output: desy_sort_key, claim_no, clm_line_num, > > clm_thru_dt, nch_clm_type_cd, rev_cntr, rev_cntr_dt, … > > > > Method of access: > > > > - Using Pentaho Kettle (an ETL tool written in Java and using > > JDBC), we simply issue the query and expect records to start streaming > > in ASAP. > > > > - Issue was replicated with really basic JDBC code in a Java test > > program. > > > > - The database doesn't have much other data and the table was > > loaded from a CSV data source with LOAD over something like 14h > > (average throughput of about 25K rows/s) > > > > - Settings: > > > > alter database "CMS_TMP" set seq_page_cost=1; > > > > alter database "CMS_TMP" set random_page_cost=4; > > > > alter database "CMS_TMP" set enable_seqscan=true; > > > > JDBC connection string with no extra params. > > > > Database has been generally configured properly. > > > > Problem: > > > > - The plan shows a full table scan followed by a sort, and then a > > gather merge. With 1.2B rows, that's crazy to try to sort that 😊 > > > > - After 10h, the query is still "silent" and no record is > > streaming in. IO is very high (80-90% disk throughput utilization) on > > the machine (the sort…). > > > > - I have tried to hack the planner to force an index scan (which > > would avoid the sort/gather steps and should start streaming data > > right away), in particular, enable_seqscan=false or seq_page_cost=2. > > This had ZERO impact on the plan to my surprise. > > > > - I changed the “order by” to include all 3 columns from the index, > > or created a non-unique index with only the first 2 columns, all to no > > effect whatsoever either. > > > > - The table was written over almost 14h at about 25K row/s and it > > seems to me I should be able to read the data back at least as fast. > > > > Why is a simple index scan not used? Why are all our efforts to try to > > force the use of the index failing? > > > > > > the query isn't that simple, there is no where condition, so PG has to read the > whole table and the index is useless. Would it be enought to select only the > columns covered by the index? > (run a vacuum on the table after loading the data, that's can enable a index- > only-scan in this case) > > > > > Regards, Andreas > > -- > 2ndQuadrant - The PostgreSQL Support Company. > www.2ndQuadrant.com > Well, even without a where clause, and a straight select with an order by on an index... The index may perform slightly more slowly, but stream data more rapidly... I guess what i am pointing out is that in ETL scenarios, enabling better continuous throughput would be better than total overall query performance? Thank you, Laurent.