At 04:43 AM 12/6/2005, Joost Kraaijeveld wrote:
Hi,
Is it possible to get this query run faster than it does now, by adding
indexes, changing the query?
SELECT customers.objectid FROM prototype.customers, prototype.addresses
WHERE
customers.contactaddress = addresses.objectid
ORDER BY zipCode asc, housenumber asc
LIMIT 1 OFFSET 283745
Explain:
Limit (cost=90956.71..90956.71 rows=1 width=55)
-> Sort (cost=90247.34..91169.63 rows=368915 width=55)
Sort Key: addresses.zipcode, addresses.housenumber
-> Hash Join (cost=14598.44..56135.75 rows=368915 width=55)
Hash Cond: ("outer".contactaddress = "inner".objectid)
-> Seq Scan on customers (cost=0.00..31392.15
rows=368915 width=80)
-> Hash (cost=13675.15..13675.15 rows=369315 width=55)
-> Seq Scan on addresses (cost=0.00..13675.15
rows=369315 width=55)
The customers table has an index on contactaddress and objectid.
The addresses table has an index on zipcode+housenumber and objectid.
TIA
customer names, customers.objectid, addresses, and addresses.objectid
should all be static (addresses do not change, just the customers
associated with them; and once a customer has been assigned an id
that better never change...).
To me, this sounds like the addresses and customers tables should be
duplicated and then physically laid out in sorted order by
<tablename>.objectid in one set and by the "human friendly"
associated string in the other set.
Then a finding a specific <tablename>.objectid or it's associated
string can be done in at worse O(lgn) time assuming binary search
instead of O(n) time for a sequential scan. If pg is clever enough,
it might be able to do better than that.
IOW, I'd try duplicating the addresses and customers tables and using
the appropriate CLUSTERed Index on each.
I know this breaks Normal Form. OTOH, this kind of thing is common
practice for data mining problems on static or almost static data.
Hope this is helpful,
Ron