I think you hit the nail right on the head when you asked:
> I wonder if they have different encoding/collations.
[headdesk]Of course. One of the requirements of the upgrade was to change the database encoding to unicode, because previously it was in an 8-bit encoding and we couldn't handle international text, which has become an absolute necessity. So when I restored the database, I took care to create it in unicode first:
Old database (PC, 8.3.17):
Name | reports
Owner | reports
Encoding | ISO_8859_8
(Locale is C)
New database (server, 9.1.5):
Name | reports
Owner | reports
Encoding | UTF8
Collate | he_IL.UTF-8
Ctype | he_IL.UTF-8
Apparently, string comparison is heavily CPU bound... Now, it seems the server is inferior to the PC in CPU-bound tasks. It's no wonder - the PC has a better and faster CPU, and each PostgreSQL process runs on a single core, so the 4 cores are not an advantage. So running the test you asked:
> \timing on
> set work_mem = 16384;
> select count(distinct foo) from (select random() as foo from
> generate_series(1,100000000)) asdf;
I get
PC: Time: 554994.343 ms
Server: Time: 660577.789 ms
Which is slightly better in favor of the PC, but still doesn't show as much of a discrepancy as in the creation of indexes.
I must point out that the actual problem is not in comparison to this PC's hardware. The database originally ran on the server, and created the same indexes happily within reasonable time until the upgrade. The upgrade process involved shutting down PostgreSQL, moving all the database files and configuration over to the PC, and starting it there (running the PC under the old server's IP, so that all the clients work with the PC now as a production machine). Then we took the server, upgraded the system and PostgreSQL on it, created a dump from the PC, and restored it on the Server.
So the situation is that the performance is 4 times worse w.r.t. the same hardware, which chugged happily when it was still the old operating system and the old PostgreSQL. And the PC is still chugging away happily during the archive, albeit a bit more slowly (for I/O reasons - it is inferior to the server there). Anything disk-bound is done better on the Server, while the PC has a slight CPU advantage.
So, I must, at this point, draw the conclusion that string comparison is a much, much heavier task in utf-8 than it is in an 8-bit encoding - or that the collation is the problem.
Running a different test, which involves string comparison, shows a bigger discrepancy:
select count( foo ), foo from ( select cast(random() as varchar(14)) > cast(random() as varchar(14)) as foo
from generate_series (1,100000000)) asdf
group by foo;
PC: Time: 308152.090 ms
Server: Time: 499631.553 ms
Finally, I created a test table, as you asked:
> create table foo as select msisdn,sme_reference from
> sms.billing__archive limit 1000000;
Then I created an index on the msisdn and sme_reference columns together. 99% of the data in the msisdn field consist of 11-digit phone numbers. Result:
PC: 5792.641 ms
Server: 23740.470 ms
Huge discrepancy there.
Next, I dropped the index, and created an index on the sme_reference column alone (which is an integer). The result:
PC: 2457.315 ms
Server: 3722.920 ms
Still a slight advantage for the PC, but not on the same order of magnitude as when strings were concerned.
OK, if you agree with my conclusions, what should I do about it? I absolutely need this database to be able to support Unicode.
Thanks a lot for the help so far!
Herouth