Re: Strange query plan

"Sorbara, Giorgio (CIOK)" <Giorgio.Sorbara@xxxxxxx> · Mon, 31 Oct 2011 14:52:21 +0100

Hi Tomas, 

and thank you for your reply.
Inline my comments

> -----Original Message-----
> From: Tomas Vondra [mailto:tv@xxxxxxxx]
> Sent: 28 October 2011 8:10 PM
> To: Sorbara, Giorgio (CIOK)
> Cc: pgsql-performance@xxxxxxxxxxxxxx
> Subject: Re:  Strange query plan
> 
> Hi,
> 
> On 28 Říjen 2011, 19:27, Sorbara, Giorgio (CIOK) wrote:
> > Dear all,
> >
> > I am new to PG but I have a solid background on tuning in Oracle and
> > MSSQL.
> > I have a query coming out from a piece of software from our SW-Stack
> (I
> > can't change it) and of course it takes a large amount of time.
> >
> > The table I am query are inherited (partitioned) and the query is the
> > following (names are changed for policy):
> 
> That's a bit ridiculous policy, especially as you've used the same fake
> column name (fk_column) for all columns. Does that mean you're reading
> just one column, or that there are actually more columns? I'd guess the
> first option, as fk_column is referenced twice in the select list ...

Sorry but that is the exact query (I'll ignore the policy and post the exact columns from now on).
Just to be clear is a query generated by Mondrian (ROLAP engine) for a degenerated dimension and it looks like this:

select
"f_suipy"."fk_theme" as "c0",
"f_suipy"."fk_theme" as "c1"
from
"gaez"."f_suipy" as "f_suipy"
where
"f_suipy"."fk_theme" = 'main_py_six_scxc'
group by
"f_suipy"."fk_theme"
order by
"f_suipy"."fk_theme" ASC;

we have a total of 18 partitions.

> >
> > the fk_column/somevalue is the "partition key" and the planner
> correctly
> > purge the inherited table accordingly.
> > Records in partitions vary from a min of 30M to max of 160M rows.
> >
> > 'Group  (cost=0.00..4674965.80 rows=200 width=17)'
> > '  ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17)'
> > '        ->  Index Scan using f_table_pkey on f_table
> (cost=0.00..5.64
> > rows=1 width=58)'
> > '              Index Cond: ((fk_column)::text = 'somevalue'::text)'
> > '        ->  Seq Scan on f_table _scxc f_table
> (cost=0.00..4360970.30
> > rows=125595944 width=17)'
> > '              Filter: ((fk_column)::text = 'somevalue'::text)'
> >
> > disabling the seq_scan do not help it forces the index but it takes
> ages.
> >
> > In each partition the value of fk_column is just one (being the
> partition
> > key) and I am expecting that this is checked on the constraint by the
> > planner.
> > Furthermore I have put an index on fk_column (tried both btree and
> hash)
> > however the plan is always a seq_scan on the partition, even if the
> index
> > has only one value?
> 
> I'm a bit confused right now. The fk_column is used for partitioning,
> so
> "fk_column = somevalue" actually means "give me all data from exactly
> one
> partition, right?

Yes, but there is an enforced constraint telling me that column can host only one value.

> 
> In that case the above behaviour is expected, because index scan would
> mean a lot of random I/O. MVCC in PostgreSQL works very differently,
> compared to Oracle for example - the indexes do not contain necessary
> visibility info (which transactions can see those records), so whenever
> you read a tuple from index, you have to check the data in the actual
> table.
> 
> So an index scan of the whole table means "read the whole index and the
> whole table" and the table is accessed randomly (which kinda defeats
> the
> db cache). So the sequential scan is the expected and perfectly sane.
> 
> BTW this should change in 9.2, as there is an index-only scan
> implementation.
> 
> > Regardless the constraint (which I think it should be taken into
> > consideration here) I am expecting that through Index Scan would
> easily
> > figure out that the value.
> > In theory there should be no need to access the table here but
> perform
> > everything on the index object (and of course in the "father" table).
> > Furthemore I don't understand why on the main table is using an index
> scan
> > (on 0 rows).
> 
> Not true. PostgreSQL MVCC does not work that - see explanation above.
> 
> > I fear I am missing something on Index usage in Postgres.
> 
> Yup, seems like that.

Ok... so since the index is not version aware I have to check the version in the "data" segment to be sure I am pointing at the right value.
I can see now there is no point at using this partitioning scheme... it was sort of perfect to me as I could drive the partition easily with a degenerated dimension. Except for this small issue (waiting more than 10 min is not an option).

Furthermore I am afraid that even partial indexes won't work.

> 
> Anyway, a few recommendations / questions:
> 
> 1) Don't post EXPLAIN output, post EXPLAIN ANALYZE if possible.

Group  (cost=0.00..4674965.80 rows=200 width=17) (actual time=13.375..550943.592 rows=1 loops=1)
   ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17) (actual time=13.373..524324.817 rows=125595932 loops=1)
         ->  Index Scan using f_suipy_pkey on f_suipy  (cost=0.00..5.64 rows=1 width=58) (actual time=0.019..0.019 rows=0 loops=1)
               Index Cond: ((fk_theme)::text = 'main_py_six_scxc'::text)
         ->  Seq Scan on f_suipy_main_py_six_scxc f_suipy  (cost=0.00..4360970.30 rows=125595944 width=17) (actual time=13.352..495259.117 rows=125595932 loops=1)
               Filter: ((fk_theme)::text = 'main_py_six_scxc'::text)
 Total runtime: 550943.699 ms
> 
> 2) How many live tuples are there? Are the really 125.595.945 live
> rows,
> or is the table bloated? When you do a "select count(*)" from the
> table,
> what number you get?

SELECT count(0)
  FROM gaez.f_suipy_main_py_six_scxc;

125595932

> 
> 3) What amount of data are we talking here? 125 million rows could be
> 10MB
> or 10GB, hard to guess.

pg_size_pretty tells me: 'f_suipy_main_py_six_scxc';'21 GB'

> 
> Tomas

g

ps: excellent benchmarking work on your site :)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance