Re: select query does not pick up the right index

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Tue, 8 Jan 2019 02:15:28 -0600

On Mon, Jan 07, 2019 at 04:09:50PM +0000, Abadie Lana wrote:

>     "channel_pkey" PRIMARY KEY, btree (channel_id)
>     "unique_chname" UNIQUE CONSTRAINT, btree (name)
>     "channel_name_channel_id_idx" btree (name, channel_id)

Note, the third index is more or less redundant.

> I would say that when you have a partitioned table, running analyse on the parent table (which includes the children) does not give the same result as running analyse on each individual child table. I don't know if it is an expected behaviour?

Right, for relkind='r' inheritence, ANALYZE parent gathers 1) stats for the
parent ONLY (stored with pg_stats inherited='f'); and, 2) stats for the parent
and its children (stored in pg_stats with inherited='t').

It *doesn't* update statistics for each of the children themselves.  Note
however that for partitions of relkind='p' tables (available since postgres 10)
ANALYZE parent *ALSO* updates stats for the children.

> But not the other query...still time-consuming because still using the wrong index in case of sample_buil_year (but curiously not the BRIN index)
> explain (analyze, buffers) select 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW',c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val from sample c WHERE c.channel_id = (SELECT channel_id FROM channel WHERE name='BUIL-B36-VA-RT-RT1:CL0001-2-ABW') order by c.smpl_time desc limit 5;
>  Limit  (cost=13.40..30.54 rows=5 width=112) (actual time=63411.725..63411.744 rows=3 loops=1)
>    Buffers: shared hit=38 read=193865
>    InitPlan 1 (returns $0)
>      ->  Index Scan using unique_chname on channel  (cost=0.41..8.43 rows=1 width=8) (actual time=0.039..0.040 rows=1 loops =1)
>            Index Cond: ((name)::text = 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW'::text)
>            Buffers: shared hit=4
>    ->  Result  (cost=4.96..5294364.58 rows=1544048 width=112) (actual time=63411.723..63411.740 rows=3 loops=1)
>          Buffers: shared hit=38 read=193865
>          ->  Merge Append  (cost=4.96..5278924.10 rows=1544048 width=80) (actual time=63411.719..63411.735 rows=3 loops=1)
>                Sort Key: c.smpl_time DESC
>                Buffers: shared hit=38 read=193865
>                ->  Index Scan Backward using sample_time_all_idx on sample c  (cost=0.12..8.14 rows=1 width=326) (actual time=0.048..0.048 rows=0 loops=1)
>                      Index Cond: (channel_id = $0)
>                      Buffers: shared hit=5
>                ->  Index Scan Backward using sample_time_b_idx on sample_buil c_1  (cost=0.42..7775.26 rows=2096 width=320) (actual time=0.008..0.009 rows=0 loops=1)
>                      Index Cond: (channel_id = $0)
>                      Buffers: shared hit=3
>                ->  Index Scan Backward using sample_time_c_idx on sample_ctrl c_2  (cost=0.42..77785.57 rows=22441 width=320) (actual time=0.006..0.006 rows=0 loops=1)
>                      Index Cond: (channel_id = $0)
>                      Buffers: shared hit=3
>                ->  Index Scan Backward using sample_time_u_idx on sample_util c_3  (cost=0.43..14922.72 rows=3830 width=320) (actual time=0.008..0.008 rows=0 loops=1)
>                      Index Cond: (channel_id = $0)
>                      Buffers: shared hit=3
>                ->  Index Scan Backward using sample_time_bm_idx on sample_buil_month c_4  (cost=0.56..2967.10 rows=740 width=74) (actual time=0.011..0.025 rows=3 loops=1)
>                      Index Cond: (channel_id = $0)
>                      Buffers: shared hit=8
>                ->  Index Scan Backward using sample_time_yb1_idx on sample_buil_year c_5  (cost=0.56..2186210.68 rows=665761 width=75) (actual time=63411.573..63411.574 rows=0 loops=1)
>                      Index Cond: (channel_id = $0)
>                      Buffers: shared read=193865

I think I see the issue..

Note, this is different than before.

Initially the query was slow due to reading the indices for the entire
heirarchy, then sorting them, then joining:
|	->  Index Scan Backward using smpl_time_bx2_idx on sample_buil_year c_5(cost=0.56..1897430.89 rows=50597832 width=328) (actual time=0.068..139840.439 rows=50597834 loops=1)
|	->  Index Scan Backward using smpl_time_cmx1_idx on sample_ctrl_month c_6 (cost=0.44..55253292.21 rows=18277124 width=85) (actual time=0.061..14610.389 rows=18277123 loops=1)
|	->  Index Scan Backward using smpl_time_cmx2_idx on sample_ctrl_year c_7 (cost=0.57..2987358.31 rows=79579072 width=76) (actual time=0.067..286316.865 rows=79579075 loops=1)
|	->  Index Scan Backward using smpl_time_ux1_idx on sample_util_month c_8 (cost=0.57..98830163.45 rows=70980976 width=82) (actual time=0.071..60766.643 rows=70980980 loops=1)
|	->  Index Scan Backward using smpl_time_ux2_idx on sample_util_year c_9 (cost=0.57..3070642.94 rows=80637888 width=83) (actual time=0.069..307091.673 rows=80637891 loops=1)

Then you ANALYZEd parent tables and added indices and constraints and started
getting bitmap scans, with new query using David's INTERVAL '0 sec':
|	...
|	->  Index Scan Backward using smpl_time_bx2_idx on sample_buil_year c_5  (cost=0.56..2023925.30 rows=3162364 width=320) (actual time=15167.330..15167.330 rows=0 loops=1)
|	     Filter: (channel_id = $0)
|	     Rows Removed by Filter: 50597834
|	     Buffers: shared hit=25913147 read=713221
|	->  Index Scan Backward using sample_time_cm_idx on sample_ctrl_month c_6  (cost=0.56..1862587.12 rows=537562 width=77) (actual time=0.048..0.048 rows=0 loops=1)
|	     Index Cond: (channel_id = $0)
|	     Buffers: shared read=4
|	->  Index Scan Backward using smpl_time_cmx2_idx on sample_ctrl_year c_7  (cost=0.57..3186305.67 rows=2094186 width=68) (actual time=25847.549..25847.549 rows=0 loops=1)
|	     Filter: (channel_id = $0)
|	     Rows Removed by Filter: 79579075
|	     Buffers: shared hit=49868991 read=1121715
|	...

I didn't notice this at first, but compare the two slow scans with the fast one.
The slow scans have no index condition: they're reading the entire index and
FILTERING on channel_id rather than searching the index for it.

Now for the "bad" query you're getting:
|	...
|	->  Index Scan Backward using sample_time_bm_idx on sample_buil_month c_4  (cost=0.56..2967.10 rows=740 width=74) (actual time=0.011..0.025 rows=3 loops=1)
|	    Index Cond: (channel_id = $0)
|	    Buffers: shared hit=8
|	->  Index Scan Backward using sample_time_yb1_idx on sample_buil_year c_5  (cost=0.56..2186210.68 rows=665761 width=75) (actual time=63411.573..63411.574 rows=0 loops=1)
|	    Index Cond: (channel_id = $0)
|	    Buffers: shared read=193865
|	...

This time, the bad scan *is* kind-of searching on channel_id, but reading the
entire 1500MB index to do it ... because channel_id is not a leading column:
|    "sample_time_yb1_idx" btree (smpl_time, channel_id)

And I think the explanation is here:

> css_archive_3_0_0=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, inherited, null_frac, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, correlation FROM pg_stats WHERE attname='channel_id' AND tablename like 'sample%' ORDER BY 1 DESC;
>  frac_mcv |     tablename     |  attname   | inherited | null_frac | n_distinct | n_mcv | n_hist | correlation
> ----------+-------------------+------------+-----------+-----------+------------+-------+--------+-------------
>     5e-06 | sample_buil_year  | smpl_time  | f         |         0 |  -0.999918 |     7 |  10001 |    0.999978
...
>   0.99987 | sample_buil_year  | channel_id | f         |         0 |         76 |    16 |     60 |    0.207932

The table is highly correlated in its physical storage WRT correlation, and
poorly correlated WRT channel_id.  Thats matter since it thinks the index will
be almost entirely cached, but not the table:
| sample_buil_year | sample_time_yb1_idx | 4492 MB    | 1522 MB

So the planner thinks that reading up to 1500MB index from cache will pay off
in ability to read the table sequentially.  If it searches the index on
channel_id, it would have to read 665761 tuples across a wide fraction of the
table (a pages here and a page there), defeating readahead, rather than reading
pages clustered/clumped together.

On Thu, Jan 03, 2019 at 12:57:27PM +0000, Abadie Lana wrote:
> Main parameters : effective_cache_size : 4GB, shared_buffers 4GB, work_mem 4MB

The issue here may just be that you have effective_cache_size=4GB, so planner
thinks that sample_time_yb1_idx is likely to be cached.  Try decreasing that
alot, since it's clearly not cached ?  Also,
effective_cache_size==shared_buffers is only accurate if you've allocated the
server's entire RAM to shared_buffers, which is unreasonable. (Or perhaps if
the OS cache is 10x busier with other processes than postgres).

I'm not sure why your query plan changed with a brin indx...it wasn't actually
used, preferring to scan the original index on channel_id, as you hoped.

|     ->  Bitmap Heap Scan on sample_buil_year c_5  (cost=15416.21..627094.50 rows=665761 width=83) (actual time=0.008..0.008 rows=0 loops=1)
|	   Recheck Cond: (channel_id = t.channel_id)
|	   ->  Bitmap Index Scan on sample_time_by_idx  (cost=0.00..15249.77 rows=665761 width=0) (actual time=0.007..0.007 rows=0 loops=1)
|		 Index Cond: (channel_id = t.channel_id)                                                                                                                                                           

Justin