Re: RAID arrays and performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 18 Sep 2008, Greg Smith wrote:
It's just that our system is doing a lot of bitmap index scans at the moment, and it'd help to be able to spread them across the 16 discs in the RAID array. It's the bottleneck in our system at the moment.

If you have some specific bitmap index scan test case suggestions you can pass along (either publicly or in private to me, I can probably help anonymize them), that's one of the things that has been holding this up.

Okay, here's a description of what we're doing. We are importing data from a large file (so we don't have a choice on the order that the data comes in). For each entry in the source, we have to look up the corresponding row in the database, and issue the correct "UPDATE" command according to what's in the database and what's in the data source. Long story - it isn't feasible to change the overall process.

In order to improve the performance, I made the system look ahead in the source, in groups of a thousand entries, so instead of running:

SELECT * FROM table WHERE field = 'something';

a thousand times, we now run:

SELECT * FROM table WHERE field IN ('something', 'something else'...);

with a thousand things in the IN. Very simple query. It does run faster than the individual queries, but it still takes quite a while. Here is an example query:

SELECT a1_.id AS a1_id, a1_.primaryIdentifier AS a2_ FROM Gene AS a1_ WHERE a1_.primaryIdentifier IN ('SPAC11D3.15', 'SPAC11D3.16c', 'SPAC11D3.17', 'SPAC11D3.18c', 'SPAC15F9.01c', 'SPAC15F9.02', 'SPAC16.01', 'SPAC18G6.01c', 'SPAC18G6.02c', 'SPAC18G6.04c', 'SPAC18G6.05c', 'SPAC18G6.06', 'SPAC18G6.07c', 'SPAC18G6.09c', 'SPAC18G6.10', 'SPAC18G6.11c', 'SPAC18G6.12c', 'SPAC18G6.13', 'SPAC18G6.14c', 'SPAC18G6.15', 'SPAC1B9.01c', 'SPAC1D4.02c', 'SPAC1D4.03c', 'SPAC1D4.04', 'SPAC1D4.05c', 'SPAC1D4.07c', 'SPAC1D4.08', 'SPAC1D4.09c', 'SPAC1D4.10', 'SPAC1D4.11c', 'SPAC1F3.11', 'SPAC23A1.10', 'SPAC23E2.01', 'SPAC23E2.02', 'SPAC23E2.03c', 'SPAC26A3.02', 'SPAC26A3.03c', 'SPAC26A3.05', 'SPAC26A3.06', 'SPAC26A3.07c', 'SPAC26A3.08', 'SPAC26A3.09c', 'SPAC26A3.10', 'SPAC26A3.11', 'SPAC26A3.14c', 'SPAC26A3.15c', 'SPAC26A3.16', 'SPAC27F1.01c', 'SPAC27F1.03c', 'SPAC27F1.04c', 'SPAC27F1.05c', 'SPAC27F1.06c', 'SPAC3H8.02', 'SPAC3H8.03', 'SPAC3H8.04', 'SPAC3H8.05c', 'SPAC3H8.06', 'SPAC3H8.07c', 'SPAC3H8.08c', 'SPAC3H8.09c', 'SPAC3H8.10', 'SPAC3H8.11', 'SPAC8E11.11', 'SPBC106.15', 'SPBC17G9.10', 'SPBC24E9.15c', 'WBGene00000969', 'WBGene00003035', 'WBGene00004095', 'WBGene00016011', 'WBGene00018672', 'WBGene00018674', 'WBGene00018675', 'WBGene00018676', 'WBGene00018959', 'WBGene00018960', 'WBGene00018961', 'WBGene00023407') ORDER BY a1_.id LIMIT 2000;

And the corresponding EXPLAIN ANALYSE:

 Limit  (cost=331.28..331.47 rows=77 width=17) (actual time=121.973..122.501 rows=78 loops=1)
   ->  Sort  (cost=331.28..331.47 rows=77 width=17) (actual time=121.968..122.152 rows=78 loops=1)
         Sort Key: id
         Sort Method:  quicksort  Memory: 29kB
         ->  Bitmap Heap Scan on gene a1_  (cost=174.24..328.87 rows=77 width=17) (actual time=114.311..121.705 rows=78 loops=1)
               Recheck Cond: (primaryidentifier = ANY ('{SPAC11D3.15...
               ->  Bitmap Index Scan on gene__key_primaryidentifier  (cost=0.00..174.22 rows=77 width=0) (actual time=44.434..44.434 rows=150 loops=1)
                     Index Cond: (primaryidentifier = ANY ('{SPAC11D3.15,SPAC11D3.16c...
 Total runtime: 122.733 ms
(9 rows)

Although it's probably in the cache, as it took 1073 ms the first time. The table has half a million rows, but tables all over the database are being accessed, so the cache is shared between several hundred million rows.

Postgres executes this query in two stages. First it does a trawl of the index (on field), and builds an bitmap. Then it fetches the pages according to the bitmap. I can see the second stage being quite easy to adapt for fadvise, but the first stage would be a little more tricky. Both stages are equally important, as they take a comparable amount of time.

We are running this database on a 16-spindle RAID array, so the benefits to our process of fully utilising them would be quite large. I'm considering if I can parallelise things a little though.

Alternately, if you'd like to join in on testing this all out more help would certainly be welcome.

How would you like me to help?

Matthew

--
What goes up must come down. Ask any system administrator.


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux