Re: experiments in query optimization

Robert Haas <robertmhaas@xxxxxxxxx> · Wed, 31 Mar 2010 11:04:13 -0400

On Wed, Mar 31, 2010 at 6:10 AM, Faheem Mitha <faheem@xxxxxxxxxxxxx> wrote:
>
> [If Kevin Grittner reads this, please fix your email address. I am getting
> bounces from your email address.]
>
> On Tue, 30 Mar 2010, Robert Haas wrote:
>
>> On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha <faheem@xxxxxxxxxxxxx>
>> wrote:
>>>
>>> Sure, but define sane setting, please. I guess part of the point is that
>>> I'm
>>> trying to keep memory low, and it seems this is not part of the planner's
>>> priorities. That it, it does not take memory usage into consideration
>>> when
>>> choosing a plan. If that it wrong, let me know, but that is my
>>> understanding.
>>
>> I don't understand quite why you're confused here.  We've already
>> explained to you that the planner will not employ a plan that uses
>> more than the amount of memory defined by work_mem for each sort or
>> hash.
>
>> Typical settings for work_mem are between 1MB and 64MB.  1GB is enormous.
>
> I don't think I am confused. To be clear, when I said "it does not take
> memory usage into consideration' I was talking about overall memory usage.
> Let me summarize:
>
> The planner will choose the plan with the minimum total cost, with the
> constraint that the number of memory used for each of certain steps is less
> than work_mem. In other words with k such steps it can use at most
>
> k(plan)*work_mem
>
> memory where k(plan) denotes that k is a function of the plan. (I'm assuming
> here that memory is not shared between the different steps). However,
> k(plan)*work_mem is not itself bounded. I fail to see how reducing work_mem
> significantly would help me. This would mean that the current plans I am
> using would likely be ruled out, and I would be left with plans which, by
> definition, would have larger cost and so longer run times. The current
> runtimes are already quite long - for the PED query, the best I can do with
> work_mem=1 GB is 2 1/2 hrs, and that is after splitting the query into two
> pieces.
>
> I might actually be better off *increasing* the memory, since then the
> planner would have more flexibility to choose plans where the individual
> steps might require more memory, but the overall memory sum might be lower.

OK, your understanding is correct.

>>>>>> You might need to create some indices, too.
>>>>>
>>>>> Ok. To what purpose? This query picks up everything from the
>>>>> tables and the planner does table scans, so conventional wisdom
>>>>> and indeed my experience, says that indexes are not going to be so
>>>>> useful.
>>>>
>>>> There are situations where scanning the entire table to build up a
>>>> hash table is more expensive than using an index.  Why not test it?
>>>
>>> Certainly, but I don't know what you and Robert have in mind, and I'm not
>>> experienced enough to make an educated guess. I'm open to specific
>>> suggestions.
>>
>> Try creating an index on geno on the columns that are being used for the
>> join.
>
> Ok, I'll try that. I guess the cols in question on geno are idlink_id and
> anno_id. I thought that I already had indexes on them, but no. Maybe I had
> indexes, but removed them.
>
> If I understand the way this works, if you request, say an INNER JOIN, the
> planner can choose different ways/algorithms to do this, as in
> http://en.wikipedia.org/wiki/Join_(SQL)#Nested_loops . It may choose a hash
> join, or an nested loop join or something else, based on cost. If the
> indexes don't exist that may make the inner loop join more expensive, so tip
> the balance in favor of using a hash join. However, I have no way to control
> which option it chooses, short of disabling eg. the hash join option, which
> is not an option for production usage anyway. Correct?

Yep.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance