Re: experiments in query optimization

Faheem Mitha <faheem@xxxxxxxxxxxxx> · Wed, 31 Mar 2010 15:40:45 +0530 (IST)

[If Kevin Grittner reads this, please fix your email address. I am 
getting bounces from your email address.]

On Tue, 30 Mar 2010, Robert Haas wrote:

On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha <faheem@xxxxxxxxxxxxx> wrote:
Sure, but define sane setting, please. I guess part of the point is that I'm
trying to keep memory low, and it seems this is not part of the planner's
priorities. That it, it does not take memory usage into consideration when
choosing a plan. If that it wrong, let me know, but that is my
understanding.

I don't understand quite why you're confused here.  We've already
explained to you that the planner will not employ a plan that uses
more than the amount of memory defined by work_mem for each sort or
hash.

Typical settings for work_mem are between 1MB and 64MB.  1GB is enormous.

I don't think I am confused. To be clear, when I said "it does not take 
memory usage into consideration' I was talking about overall memory usage. 
Let me summarize:

The planner will choose the plan with the minimum total cost, with the 
constraint that the number of memory used for each of certain steps is 
less than work_mem. In other words with k such steps it can use at most

k(plan)*work_mem

memory where k(plan) denotes that k is a function of the plan. (I'm 
assuming here that memory is not shared between the different steps). 
However, k(plan)*work_mem is not itself bounded. I fail to see how 
reducing work_mem significantly would help me. This would mean that the 
current plans I am using would likely be ruled out, and I would be left 
with plans which, by definition, would have larger cost and so longer run 
times. The current runtimes are already quite long - for the PED query, 
the best I can do with work_mem=1 GB is 2 1/2 hrs, and that is after 
splitting the query into two pieces.

I might actually be better off *increasing* the memory, since then the 
planner would have more flexibility to choose plans where the individual 
steps might require more memory, but the overall memory sum might be 
lower.

You might need to create some indices, too.

Ok. To what purpose? This query picks up everything from the
tables and the planner does table scans, so conventional wisdom
and indeed my experience, says that indexes are not going to be so
useful.

There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?

Certainly, but I don't know what you and Robert have in mind, and I'm not
experienced enough to make an educated guess. I'm open to specific
suggestions.

Try creating an index on geno on the columns that are being used for the join.

Ok, I'll try that. I guess the cols in question on geno are idlink_id and 
anno_id. I thought that I already had indexes on them, but no. Maybe I had 
indexes, but removed them.

If I understand the way this works, if you request, say an INNER JOIN, the 
planner can choose different ways/algorithms to do this, as in 
http://en.wikipedia.org/wiki/Join_(SQL)#Nested_loops . It may choose a 
hash join, or an nested loop join or something else, based on cost. If the 
indexes don't exist that may make the inner loop join more expensive, so 
tip the balance in favor of using a hash join. However, I have no way to 
control which option it chooses, short of disabling eg. the hash join 
option, which is not an option for production usage anyway. Correct?

                                                          Regards, Faheem.
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance