Slow performance with Group By

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Greetings all,

This question has probably been asked many times, but I was unable to use the list archives to search, since the term "Group" matches thousands of of messages with the names of user groups in them... so sorry if I'm repeating!

Here's the problem: I have a table of 10,000,000 records called "indethom", each record representing a word in the works of a particular author. Each record contains, among other columns, an CHAR(5) column representing the "lemma" code (i.e. which word it is) called "codelemm", and an integer representing a textual unit, i.e. chapter or other division of a work (these are numbered consecutively from 0 to around 50,000), called "sectref". What I want to do is find out how many times every word occurs in each textual unit (or no row returned for textual units where a particular word doesn't appear). I used a group-by clause to group by "sectref", and then used the COUNT(codelemm) function to sum up the occurrences. The codelemm column had to be grouped on, in order to satisfy Postgres's requirements. Here's the query as I have it:

> create table matrix2.tuo as select codelemm, sectref, count(codelemm) from indethom group by codelemm, sectref;

And the explain results are as follows:

>it=> explain select codelemm, sectref, count(codelemm) from indethom group by codelemm, sectref;
>                                   QUERY PLAN
>----------------------------------------------------------------------- ---------
> GroupAggregate  (cost=2339900.60..2444149.44 rows=1790528 width=13)
>   ->  Sort  (cost=2339900.60..2364843.73 rows=9977252 width=13)
>         Sort Key: codelemm, sectref
> -> Seq Scan on indethom (cost=0.00..455264.52 rows=9977252 width=13)

I have an index defined as follows:

> create index indethom_clemm_sect_ndx on indethom using btree(codelemm, sectref);

I also performed an ANALYZE after creating the index.

I have the gut feeling that there's got to be a better way than a sequence scan on 10,000,000 records, but I'll be darned if I can find any way to improve things here.

Thanks for any help you all can offer!!

Erik Norvelle



[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux