After doing this testing, have since added some functionality in our ad hoc reporting tool to allow us to tune individual queries by turning on and off individual join types at runtime. As we hear of slow reports, we've been individually turning off the nested loops on those reports. Almost always, this has increased the performance of the reports, sometimes in a completely amazing fashion (many, many minutes to seconds at times). It of course doesn't help everything and turning off nested loops in general causes overall slowdown in other parts of the system.
As this has gone on over the last couple of weeks, it feels like we either have a misconfiguration on the server, or we are tickling a mis-estimation bug in the planner. I'm hoping it's the former. The db server has 8G of memory and raid1 -wal, raid10- data configuration, os is linux 2.6.9, db is 8.2.6. The db is a utf-8 db if that is of any bearing and autovac and bgwriter are on.
Nondefault settings of interest from postgresql.conf
shared_buffers = 1024MB # min 128kB or max_connections*16kB
work_mem = 256MB # min 64kB
maintenance_work_mem = 256MB # min 1MB
random_page_cost = 1.75 # same scale as above
effective_cache_size = 4096MB
default_statistics_target = 100 # range 1-1000
If nothing else, perhaps this will help somebody else who has run into the same problem. If explain analyze of a query shows a large mis-estimation of rows returned on a join (estimate=1, actual=2k) causing the planner to choose nested loops instead of another join type, you might try running the query with nested loops set to off and see if that helps w/ performance.
Thanks,
-Chris