Summary: the following query takes around 12 seconds on my test machine. On my production machine, it's at half an hour and counting. What's going on? Details: SELECT legfrom, count(*) as totaldeps, count(*) filter (where depdly<='00:00:00'::interval) as d0, count(*) filter (where depdly<='00:05:00'::interval) as d5, count(*) filter (where depdly<='00:14:00'::interval) as d14, count(*) filter (where depdly<='00:30:00'::interval) as d30, count(*) filter (where legto=destination) as totalarriv, count(*) filter (where arrdly<='00:00:00'::interval AND legto=destination) as a0, count(*) filter (where arrdly<='00:05:00'::interval AND legto=destination) as a5, count(*) filter (where arrdly<='00:14:00'::interval AND legto=destination) as a14, count(*) filter (where arrdly<='00:30:00'::interval AND legto=destination) as a30, count(*) filter (where legto!=destination) as div FROM (SELECT distinct on (legdetail.flightnum, legfrom, outtime) legdetail.flightnum, legfrom, legto, outtime, intime, depart_time, arrival_time, destination, outtime::time-oag_schedules.depart_time as depdly, intime::time-oag_Schedules.arrival_time as arrdly FROM oag_batches NATURAL JOIN oag_schedules INNER JOIN legdetail ON outtime::date BETWEEN oag_schedules.startdate AND coalesce(oag_schedules.enddate,'infinity') AND outtime::date BETWEEN oag_batches.eff_from AND coalesce(oag_batches.eff_to,'infinity') AND extract(isodow from outtime::date)=ANY(frequency) AND oag_schedules.flightnum::text=legdetail.flightnum AND oag_schedules.origin=legdetail.legfrom WHERE outtime::date>='2017-01-01' AND legdetail.flightnum~E'^[0-9]*$' AND legdetail.flightnum::integer < 8000 AND csom='S' ORDER BY legdetail.flightnum, legfrom, outtime, oag_batches.eff_from DESC) s1 GROUP BY legfrom ORDER BY legfrom; which, when run on my test server, has this explain analyze output: https://explain.depesz.com/s/4piv. Around 12 second runtime, which isn't too bad (in the grand scheme of things), although there is probably room for improvement. oag_batches and oag_schedules are postgres_fdw foreign tables in the same database cluster (different DB same host/server instance). oag_schedules has about 80,000 rows, legdetail has about 60,000 rows, and oag_batches has 125 rows, so while we are dealing with a fair amount of data, it's nothing extreme - I have tables with billions of rows that it handles just fine. With the query comfortably running on my test machine, I went ahead and tried it out on my production machine. And I waited. And waited. And waited some more. As of the writing of this line, pg_stat_activity shows that the query has been running for over 30 minutes. It also shows a status of "active", and nothing under the wait_event column. A second entry in pg_stat_activity shows the foreign data wrapper connection, with a state of "idle in transaction" and again nothing under the wait_event column. Top shows the two processes using around 62% and 45% CPU, with the rest of machine being 96% idle, so they are doing *something*, but not even enough to max out one of the cores, nor is the machine as a whole maxed out. The PostgreSQL log shows plenty of activity (and none of the users of the database are complaining), so apparently the database as a whole is working fine - it's just this one query that's not completing. Further, iotop shows no significant disk access - read is a solid 0.00 B/s, and write is only around 80 K/s, so it's not like it's waiting on disk i/o. Both test server and production are running PostgreSQL 9.6, although my test server is at 9.6.1 while production is at 9.6.5 (huh, production got ahead of test. I'll have to fix that. Oh well), and I did a dump from production to test using -C just before starting work on the query, so the datasets are pretty closely matched. The production server is configured to take full(er) advantage of the hardware than my test database, so if anything I would expect it to be faster. Hardware differences obviously can have an effect on the query speed, but I wouldn't think a difference of 12 seconds to over half an hour (and counting). So what could explain this huge discrepancy? One difference I can think of is that my production database is replicating (streaming replication), but since this is just a select, I can't think why that would make a difference... any thoughts? ----------------------------------------------- Israel Brewster Systems Analyst II Ravn Alaska 5245 Airport Industrial Rd Fairbanks, AK 99709 (907) 450-7293 ----------------------------------------------- |
BEGIN:VCARD VERSION:3.0 N:Brewster;Israel;;; FN:Israel Brewster ORG:Frontier Flying Service;MIS TITLE:PC Support Tech II EMAIL;type=INTERNET;type=WORK;type=pref:israel@xxxxxxxxxxxxxxxxxx TEL;type=WORK;type=pref:907-450-7293 item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701; item1.X-ABADR:us CATEGORIES:General X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson END:VCARD