Re: Are bitmap index scans slow to start?

"Carlo Stonebanks" <stonec.register@xxxxxxxxxxxx> · Fri, 22 Feb 2013 15:05:34 -0500

>> Also, you might could try clustering newly created tables on session_id and setting the fillfactor down so rows with the same session id will stick together on disk.
<< 

My understanding of PG’s cluster is that this is a one-time command that creates a re-ordered table and doesn’t maintain the clustered order until the command is issued again. During the CLUSTER, the table is read and write locked. So, in order for me to use this I would need to set up a timed event to CLUSTER occasionally.

>> I can't really help, but I can make it more clear why postgres is choosing a _bitmap_ index scan rather than a regular index scan
<< 

The EXPLAIN ANALYZE is showing it is taking a long time to prepare the bitmap (i.e.->  Bitmap Index Scan on log_2013_01_session_idx  (cost=0.00..63186.52
rows=2947664 width=0) (actual time=32611.918..32611.918 rows=2772042 loops=1)" Index Cond: (session_id = 27)" the bitmap scan is actually very fast. Jeff sasys that the bitmap is not cached, so I will assume the PG general caches being created are of general use. 

I think what I need to do is figure out is:

1)      Why does it take 36 seconds to set up the general index caches?
2)      What can I do about it (what stats do I need to look at)?
3)      How can I force these caches to expire so I can tell if the strategy worked?

From: Nikolas Everett [mailto:nik9000@xxxxxxxxx] 
Sent: February 22, 2013 2:05 PM
To: Carlo Stonebanks
Cc: Marc Mamin; Jeff Janes; pgsql-performance@xxxxxxxxxxxxxx
Subject: Re:  Are bitmap index scans slow to start?

I can't really help, but I can make it more clear why postgres is choosing a _bitmap_ index scan rather than a regular index scan.  With a regular index scan it pumps the index for the locations of the rows that it points to and loads those rows as it finds them.  This works great if the rows in the index are sorta sorted - that way it isn't jumping around the table randomly.  Random io is slow.  In a bitmap index scan pg pumps the index and buffers the by shoving them in a big bitmap.  Then, it walks the bitmap in order to produce in order io.  PG makes the choice based on a measure of the index's correlation.

The problem comes down to you inserting the sessions concurrently with one another.  My instinct would be to lower the FILLFACTOR on newly created indecies so they can keep their entries more in order.  I'm not sure why I have that instinct but it feels right.  Also, you might could try clustering newly created tables on session_id and setting the fillfactor down so rows with the same session id will stick together on disk.

Now that I look stuff up on the internet I'm not sure where I saw that pg tries to maintain a cluster using empty space from FILLFACTOR but I _think_ it does.  I'm not sure what is going on with my google foo today.

Nik

On Fri, Feb 22, 2013 at 12:50 PM, Carlo Stonebanks <stonec.register@xxxxxxxxxxxx> wrote:
A cool idea, but if I understand it correctly very specific and fussy. New DB’s are spawned on this model, and all the developers would have to be aware of this non-standard behaviour, and DBA”s would have to create these indexes every month, for every DB (as the log tables are created every month). There are 89 session_id values in the January log (log_2013_01) so this would quickly get out of control. But – like I said – an interesting idea for more specific challenges.

From: Marc Mamin [mailto:M.Mamin@xxxxxxxxxxxx] 
Sent: February 21, 2013 2:41 PM
To: Jeff Janes; Carlo Stonebanks
Cc: pgsql-performance@xxxxxxxxxxxxxx
Subject: AW:  Are bitmap index scans slow to start?

>Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster.  But with a name like >"session_id", I don't know how long such clustering would last though.

>If I'm right about the index disk-read time, then switching to a plain index scan rather than a bitmap index scan would make no difference--either way the data has to come off the disk.  

>>I'd prefer a
>>strategy that allowed fast performance the first time, rather than slow the
>>first time and extremely fast subsequently.
Hello,

if the index is only used to locate rows for single session_id, you may consider split it in a set of partial indexes.

e.g. 
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;

(can be built in parallel using separate threads)

Then you will have to ensure that all your WHERE clauses also contain the index condition:

WHERE session_id = 27 AND session_id%4 =27%4

regards,

Marc Mamin