RE: Zero throughput on a query on a very large table.

"ldh@xxxxxxxxxxxxxxxxxx" <ldh@xxxxxxxxxxxxxxxxxx> · Fri, 25 Jan 2019 17:47:18 +0000

> -----Original Message-----
> From: Andreas Kretschmer <andreas@xxxxxxxxxxxxxxx>
> Sent: Friday, January 25, 2019 00:55
> To: pgsql-performance@xxxxxxxxxxxxxxxxxxxx
> Subject: Re: Zero throughput on a query on a very large table.
> 
> 
> 
> Am 25.01.19 um 06:20 schrieb ldh@xxxxxxxxxxxxxxxxxx:
> >
> > Hello,
> >
> > We have been stuck for the past week on a query that simply won’t
> > “execute”. We have a table with 1.2B rows that took around 14h to
> > load, but a simple select takes forever and after 10h, no records are
> > coming through still.
> >
> > Environment:
> >
> >      - Table tmp_outpatient_rev with 41 VARCHAR columns
> > (desy_sort_key, claim_no, clm_line_num, clm_thru_dt, nch_clm_type_cd,
> > rev_cntr, rev_cntr_dt, …)
> >
> >      - 1.2B rows (Billion with a ‘B’)
> >
> >      - A single Unique Index on columns desy_sort_key, claim_no,
> > clm_line_num
> >
> >      - select pg_size_pretty(pg_relation_size('tmp_outpatient_rev'))
> > --> 215GB
> >
> >      - Database Server: 64GB, 8 cores/16 threads, HDDs 10K
> >
> >      - Linux
> >
> >      - PG 11.1
> >
> > Query:
> >
> >      select * from tmp_outpatient_rev order by desy_sort_key, claim_no
> >
> > Plan:
> >
> >     Gather Merge (cost=61001461.16..216401602.29 rows=1242732290
> > width=250)
> >
> >       Output: desy_sort_key, claim_no, clm_line_num, clm_thru_dt,
> > nch_clm_type_cd, rev_cntr, rev_cntr_dt, …
> >
> >       Workers Planned: 10
> >
> >       ->  Sort (cost=61000460.97..61311144.04 rows=124273229
> > width=250)
> >
> >             Output: desy_sort_key, claim_no, clm_line_num,
> > clm_thru_dt, nch_clm_type_cd, rev_cntr, rev_cntr_dt, …
> >
> >             Sort Key: tmp_outpatient_rev.desy_sort_key,
> > tmp_outpatient_rev.claim_no
> >
> >             ->  Parallel Seq Scan on public.tmp_outpatient_rev
> > (cost=0.00..29425910.29 rows=124273229 width=250)
> >
> >                   Output: desy_sort_key, claim_no, clm_line_num,
> > clm_thru_dt, nch_clm_type_cd, rev_cntr, rev_cntr_dt, …
> >
> > Method of access:
> >
> >     - Using Pentaho Kettle (an ETL tool written in Java and using
> > JDBC), we simply issue the query and expect records to start streaming
> > in ASAP.
> >
> >    - Issue was replicated with really basic JDBC code in a Java test
> > program.
> >
> >     - The database doesn't have much other data and the table was
> > loaded from a CSV data source with LOAD over something like 14h
> > (average throughput of about 25K rows/s)
> >
> >     - Settings:
> >
> >               alter database "CMS_TMP" set seq_page_cost=1;
> >
> >               alter database "CMS_TMP" set random_page_cost=4;
> >
> >               alter database "CMS_TMP" set enable_seqscan=true;
> >
> >               JDBC connection string with no extra params.
> >
> >               Database has been generally configured properly.
> >
> > Problem:
> >
> >     - The plan shows a full table scan followed by a sort, and then a
> > gather merge. With 1.2B rows, that's crazy to try to sort that 😊
> >
> >     - After 10h, the query is still "silent" and no record is
> > streaming in. IO is very high (80-90% disk throughput utilization) on
> > the machine (the sort…).
> >
> >     - I have tried to hack the planner to force an index scan (which
> > would avoid the sort/gather steps and should start streaming data
> > right away), in particular, enable_seqscan=false or seq_page_cost=2.
> > This had ZERO impact on the plan to my surprise.
> >
> >    - I changed the “order by” to include all 3 columns from the index,
> > or created a non-unique index with only the first 2 columns, all to no
> > effect whatsoever either.
> >
> >     - The table was written over almost 14h at about 25K row/s and it
> > seems to me I should be able to read the data back at least as fast.
> >
> > Why is a simple index scan not used? Why are all our efforts to try to
> > force the use of the index failing?
> >
> >
> 
> the query isn't that simple, there is no where condition, so PG has to read the
> whole table and the index is useless. Would it be enought to select only the
> columns covered by the index?
> (run a vacuum on the table after loading the data, that's can enable a index-
> only-scan in this case)
> 
> 
> 
> 
> Regards, Andreas
> 
> --
> 2ndQuadrant - The PostgreSQL Support Company.
> www.2ndQuadrant.com
> 

Well, even without a where clause, and a straight select with an order by on an index... The index may perform slightly more slowly, but stream data more rapidly... I guess what i am pointing out is that in ETL scenarios, enabling better continuous throughput would be better than total overall query performance?

Thank you,
Laurent.