On Wed, Mar 31, 2010 at 6:10 AM, Faheem Mitha <faheem@xxxxxxxxxxxxx> wrote: > > [If Kevin Grittner reads this, please fix your email address. I am getting > bounces from your email address.] > > On Tue, 30 Mar 2010, Robert Haas wrote: > >> On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha <faheem@xxxxxxxxxxxxx> >> wrote: >>> >>> Sure, but define sane setting, please. I guess part of the point is that >>> I'm >>> trying to keep memory low, and it seems this is not part of the planner's >>> priorities. That it, it does not take memory usage into consideration >>> when >>> choosing a plan. If that it wrong, let me know, but that is my >>> understanding. >> >> I don't understand quite why you're confused here. We've already >> explained to you that the planner will not employ a plan that uses >> more than the amount of memory defined by work_mem for each sort or >> hash. > >> Typical settings for work_mem are between 1MB and 64MB. 1GB is enormous. > > I don't think I am confused. To be clear, when I said "it does not take > memory usage into consideration' I was talking about overall memory usage. > Let me summarize: > > The planner will choose the plan with the minimum total cost, with the > constraint that the number of memory used for each of certain steps is less > than work_mem. In other words with k such steps it can use at most > > k(plan)*work_mem > > memory where k(plan) denotes that k is a function of the plan. (I'm assuming > here that memory is not shared between the different steps). However, > k(plan)*work_mem is not itself bounded. I fail to see how reducing work_mem > significantly would help me. This would mean that the current plans I am > using would likely be ruled out, and I would be left with plans which, by > definition, would have larger cost and so longer run times. The current > runtimes are already quite long - for the PED query, the best I can do with > work_mem=1 GB is 2 1/2 hrs, and that is after splitting the query into two > pieces. > > I might actually be better off *increasing* the memory, since then the > planner would have more flexibility to choose plans where the individual > steps might require more memory, but the overall memory sum might be lower. OK, your understanding is correct. >>>>>> You might need to create some indices, too. >>>>> >>>>> Ok. To what purpose? This query picks up everything from the >>>>> tables and the planner does table scans, so conventional wisdom >>>>> and indeed my experience, says that indexes are not going to be so >>>>> useful. >>>> >>>> There are situations where scanning the entire table to build up a >>>> hash table is more expensive than using an index. Why not test it? >>> >>> Certainly, but I don't know what you and Robert have in mind, and I'm not >>> experienced enough to make an educated guess. I'm open to specific >>> suggestions. >> >> Try creating an index on geno on the columns that are being used for the >> join. > > Ok, I'll try that. I guess the cols in question on geno are idlink_id and > anno_id. I thought that I already had indexes on them, but no. Maybe I had > indexes, but removed them. > > If I understand the way this works, if you request, say an INNER JOIN, the > planner can choose different ways/algorithms to do this, as in > http://en.wikipedia.org/wiki/Join_(SQL)#Nested_loops . It may choose a hash > join, or an nested loop join or something else, based on cost. If the > indexes don't exist that may make the inner loop join more expensive, so tip > the balance in favor of using a hash join. However, I have no way to control > which option it chooses, short of disabling eg. the hash join option, which > is not an option for production usage anyway. Correct? Yep. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance