Interesting Query Performance Question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Any explanation for this is appreciated...

I had a query which just seemed to peg the processor. I decided to let it go all night, and after 16 hours, it was still stuck at 100% processor usage. I figured something must be wrong, so I tried a few things. Using the following logic, I was able to get the results I was looking for in about 1 minute (instead of 16+ hours of no result). It all came down to getting around the use of "NOT IN (subquery)" where "subquery" has 20,000,000 rows. "IN (subquery)" was quick as could be. Here's what I found. 

Tables:
------------------------------
|       a     |      b       |
--------------|---------------
|guid vc(32)  | guid vc(32)  |
|unitid vc(15)| unitid vc(15)|
|ftime ts     |              |
|source vc(10)|              |
------------------------------
|65,000,000 rw| 500,000 rows |
------------------------------

Here is the query that hung up for 16 hours:

A:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid NOT IN (SELECT DISTINCT guid FROM a) AND
 unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid 
ORDER BY total DESC;

I thought I'd narrow down a.guid and a.unitid and substitute these for the subqueries.
B: SELECT DISTINCT unitid INTO a_u_unitid FROM a; (     1,082 rows)
C: SELECT DISTINCT guid   INTO a_u_guid   FROM a; (20,000,000 rows)

That didn't help on its own. So I decided to try the query without the "NOT". It finished in a minute or two. Granted, the result is the opposite of what I wanted, but at least I knew the major source of the problem. I still had the substituted queries B & C in, so it looked like this:

D:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid IN (SELECT guid FROM a_u_guid) AND
 unitid IN (SELECT unitid FROM a_u_unitid) 
GROUP BY unitid 
ORDER BY total DESC;

Next, I decided to just get my "NOT" set of guid's from b itself. If my logic isn't correct here, someone let me know. I did this, and got results very quickly (a minute perhaps):

E:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT guid FROM a_u_guid)) AND
 unitid IN (SELECT unitid FROM a_u_unitid) 
GROUP BY unitid 
ORDER BY total DESC;

Keep in mind a has 65,000,000 rows and b only has 500,000. The above query (E) gave me the original results I was looking for (E gives the expected results from A, if A had ever finished, assuming my logic is correct). I then decided to get rid of the intermediate tables B & C, so I was back to:

F:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT DISTINCT guid FROM a) AND
 Unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid 
ORDER BY total DESC;

While F was running, I started typing this email. It still has not finished, but I have a feeling it will within another half hour or so. What I am taking away from this is that the use of "NOT IN (subquery)" where subquery has a large number of results will literally shut down performance and never give a result. Further, I am taking away that creating separate tables with just the DISTINCT values from a table with millions of rows also saves time. The problem here being that the separate table is not auto-updating when values in the master table change.

Query E then is apparently the way to go, but shouldn't there be a way to get the query planner to take these steps itself? If A had ever finished, I'd sure like to have seen an EXPLAIN ANALYZE on it. If F does finish, I made run an EXPLAIN ANALYZE on it versus E (I kind of expected F to finish while writing this email).

Thanks,
Jon

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux