Re: Expected performance of querying 5k records from 4 million records?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the help, Jeff and Josh.  It looks reclustering the multi-column index might solve things.  For my particular query, because I'm getting a range of records back, it makes sense that reclustering will benefit me if I have a slow disk even if I had expected that the indices would be sufficient .  I now need to make sure that the speed up I'm seeing is not because things have been cached.

That being said, here's what I have:
2CPUs, 12 physical cores, hyperthreaded (24 virtual cores), 2.67Ghz
96G RAM, 80G available to dom0
CentOS 5.8, Xen
3Gbps SATA (7200 RPM, Hitachi ActiveStar Enterprise Class)

So, I have lots of RAM, but not necessarily the fastest disk.

default_statistics_target = 50 # pgtune wizard 2011-03-16
maintenance_work_mem = 1GB # pgtune wizard 2011-03-16
constraint_exclusion = on # pgtune wizard 2011-03-16
checkpoint_completion_target = 0.9 # pgtune wizard 2011-03-16
effective_cache_size = 24GB # pgtune wizard 2011-03-16
work_mem = 192MB # pgtune wizard 2011-03-16
wal_buffers = 8MB # pgtune wizard 2011-03-16
checkpoint_segments = 128 # pgtune wizard 2011-03-16, amended by am, 30may2011
shared_buffers = 4GB # pgtune wizard 2011-03-16
max_connections = 100 # pgtune wizard 2011-03-16: 80, bumped up to 100
max_locks_per_transaction = 1000

I didn't know about explain (analyze,buffers).  Very cool.  So, based on your advice,  I ran it and here's what I found:

1st time I ran the query:
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on data_part_213  (cost=113.14..13725.77 rows=4189 width=16) (actual time=69.807..2763.174 rows=5350 loops=1)
   Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
   Filter: ((dataset_id = 213) AND (stat_id = 6))
   Buffers: shared read=4820
   ->  Bitmap Index Scan on data_unq_213  (cost=0.00..112.09 rows=5142 width=0) (actual time=51.918..51.918 rows=5350 loops=1)
         Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
         Buffers: shared read=19
 Total runtime: 2773.099 ms
(8 rows)

the second time I run the query it's very fast, since all the buffered read counts have turned into hit counts showing I'm reading from cache (as I expected):
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on data_part_213  (cost=113.14..13725.77 rows=4189 width=16) (actual time=1.661..14.376 rows=5350 loops=1)
   Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
   Filter: ((dataset_id = 213) AND (stat_id = 6))
   Buffers: shared hit=4819
   ->  Bitmap Index Scan on data_unq_213  (cost=0.00..112.09 rows=5142 width=0) (actual time=0.879..0.879 rows=5350 loops=1)
         Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
         Buffers: shared hit=18
 Total runtime: 20.232 ms
(8 rows)



Next, I tried reclustering a partition with the multicolumn-index.  the big things is that the read count has dropped dramatically!
 Index Scan using data_part_214_dataset_stat_data_idx on data_part_214  (cost=0.00..7223.05 rows=4265 width=16) (actual time=0.093..7.251 rows=5350 loops=1)
   Index Cond: ((dataset_id = 214) AND (data_id >= 50544630) AND (data_id <= 50549979) AND (stat_id = 6))
   Buffers: shared hit=45 read=24
 Total runtime: 12.929 ms
(4 rows)


second time:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using data_part_214_dataset_stat_data_idx on data_part_214  (cost=0.00..7223.05 rows=4265 width=16) (actual time=0.378..7.696 rows=5350 loops=1)
   Index Cond: ((dataset_id = 214) AND (data_id >= 50544630) AND (data_id <= 50549979) AND (stat_id = 6))
   Buffers: shared hit=68
 Total runtime: 13.511 ms
(4 rows)

So, it looks like clustering the index appropriately fixes things!  Also, I'll recreate the index switching the order to (dataset_id, stat_id,data_id)

thanks!

On Fri, Jun 15, 2012 at 11:20 AM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
On Fri, Jun 15, 2012 at 9:17 AM, Anish Kejariwal <anishkej@xxxxxxxxx> wrote:
>
> Below are the tables, queries, and execution plans with my questions with
> more detail.  (Since I have 250 partitions, I can query one partition after
> the other to ensure that I'm not pulling results form the cache)

Doesn't that explain why it is slow?  If you have 15000 rpm drives and
each row is in a different block and uncached, it would take 20
seconds to read them all in.  You are getting 10 times better than
that, either due to caching or because your rows are clustered, or
because effective_io_concurrency is doing its thing.

>
> explain analyze select data_id, dataset_id, stat from data_part_201 where
> dataset_id = 201
> and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

What does "explain (analyze, buffers)" show?


> QUESTION 1: you can see that the query is very simple.  is this the optimal
> execution plan? any tips on what to look into to increase performance?
>
> I then tried adding the following multi-column index:
> "data_part_202_dataset_regionset_data_idx" btree (dataset_id, data_id,
> stat_id)

Since you query stat_id for equality and data_id for range, you should
probably reverse the order of those columns in the index.


>
> QUESTION 3:
> If I do the following:  reindex table data_part_204 the query now takes
> 50-70 milliseconds.  Is this because the table is getting cached?  How do I
> know if a particular query is coming from the cache?

Using explain (analyze, buffers) will show you if it is coming from
the shared_buffers cache.

It is harder to see if it is coming from the file system cache.  If
the server is mostly idle other than your stuff, you can run vmstat
and see how much physical IO is caused by your activity.

Cheers,

Jeff


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux