Merge sort/postgis performance tweaking?

Israel Brewster <ijbrewster@xxxxxxxxxx> · Fri, 8 Nov 2019 14:27:51 -0900

I have the following query:
SELECT
	read_date,
	to_char(read_date,'YYYY-MM-DD') as text_date,
	CASE WHEN lat_2>lat_1 THEN
		ns_dist*-1
	ELSE
		ns_dist
	END as ns_dist,
	CASE WHEN lon_2>lon_1 THEN
		ew_dist*-1
	ELSE
		ew_dist
	END as ew_dist,
	CASE WHEN alt_2>alt_1 THEN
		(ud_dist*-1)::float
	ELSE
		ud_dist::float
	END as ud_dist,
	lat_e/10::float as lat_e,
	lon_e/10::float as lon_e,
	alt_e/10::float as alt_e,
        CASE WHEN rapid THEN
            1
        ELSE
            0
        END as rapid
FROM
	(SELECT
		g1.read_date as read_date,
		g1.rapid as rapid,
		g1.lat as lat_1,
		g1.lon as lon_1,
		g1.alt as alt_1,
		g2.lat as lat_2,
		g2.lon as lon_2,
		g2.alt as alt_2,
		ST_Distance(ST_SetSRID( ST_Point( g1.lon, g1.lat), 4326)::geography,
					ST_SetSRID( ST_Point( g1.lon, g2.lat), 4326)::geography)/1000 as ns_dist,
		ST_Distance(ST_SetSRID( ST_Point( g1.lon, g1.lat), 4326)::geography,
					ST_SetSRID( ST_Point( g2.lon, g1.lat), 4326)::geography)/1000 as ew_dist,
		g1.alt-g2.alt ud_dist,
		sqrt(g1.lat_e^2+g2.lat_e^2) as lat_e,
		sqrt(g1.lon_e^2+g2.lon_e^2) as lon_e,
		sqrt(g1.alt_e^2+g2.alt_e^2) as alt_e
	FROM gps_data g1
	INNER JOIN gps_data g2 ON g2.read_date=g1.read_date
	WHERE g1.station=%s AND g2.station=%s
	) g3
ORDER BY read_date;

Which, when executed on my system for a particular par of g1.station and g2.station takes about 190ms to run (explain output here: https://explain.depesz.com/s/8Nf9 ). This is not horrible, but when running this query for a dozen stations, it becomes noticeable. As such, I was wondering if I could do better. Looking at that explain output, I noticed that the largest time sink was the Merge Join at 114ms. Some research into that gave me the command "set enable_mergejoin=off” which I tried just for fun to see what would happen. 

Somewhat to my surprise, this actually shaved a good 80ms off the total run time (see https://explain.depesz.com/s/kpncZ). Looking at that, I see it does a Parallel Hash Join, which looks like it should be slower, but I guess since it can do it in parallel, it winds up faster? Not sure on that, but the overall time is definitely significantly faster. Which leaves me with a couple of questions:

1) Since not using a merge join is, in this case apparently better, is there some way I can tweak the query/settings such that it knows this without having to force the merge join off?

2) If not, are there any side effects/gotchas I should be aware of if I simply call that set command before each query (or, presumably before the series of queries)?

3) Is there anything else I can do to optimize the query? I have tried using a pre-calculated point column rather than calculating ST_Point( g1.lon, g1.lat), as well as providing “false” as the last argument to ST_Distance, but while these did speed things up some, it was only about 10ms or so, so not terribly significant.

If it helps, the goal behind the somewhat odd arguments to ST_Distance is to find the component North/South and East/West distances, rather than the actual total “point a to point b” distance. Perhaps there is a better way of obtaining that goal?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145