Does the plan change significantly with this-
set session work_mem='250MB';
set session geqo_threshold = 20;
set session join_collapse_limit = 20;
With that expensive sort spilling to disk and then aggregating after that, it would seem like the work_mem being significantly increased is going to make the critical difference. Unless it could fetch the data sorted via an index, but that doesn't seem likely.
I would suggest increase default_statistics_target, but you have good estimates already for the most part. Hopefully someone else will chime in with more.
Michael Lewis