Re: Internal operations when the planner makes a hash join.

"Igor Neyman" <ineyman@xxxxxxxxxxxxxx> · Wed, 24 Feb 2010 09:46:31 -0500

> -----Original Message-----
> From: negora [mailto:negora@xxxxxxxxxx] 
> Sent: Tuesday, February 23, 2010 4:33 PM
> To: Scott Carey
> Cc: Alvaro Herrera; pgsql-performance@xxxxxxxxxxxxxx
> Subject: Re: Internal operations when the planner makes a hash join.
> 
> Thank you for explaining me the internal behaviour of the 
> PostgreSQL engine. I'll try to look for more information 
> about that hash tables. It sounds really really interesting. 
> Your information was very useful.
> 
> The origin of my doubt resides in the fact that I need to do 
> a joint between 3 HUGE tables (millions of registries) and do 
> certain operations with the retrieved information. I was 
> deciding whether to use one SELECT with 3 JOINs, as I've been 
> doing since the beginning, or build a PL/PgSQL function based 
> on 3 nested "FOR ... IN SELECT ... LOOP" 
> structures which tried to minimize the subsequent table 
> searches storing intermediate useful data in arrays 
> (curiously, these would act as the hash tables which you 
> mention, but in a very very rudimentary way). In a case like 
> this one (possibly unable to fit in RAM), Is also JOIN the 
> best solution?
> 
> Since I've to retrieve such a big amount of columns and 
> crossed registries I had started to think that using 1 SELECT 
> with 3 JOINs would increase the number of table searches a 
> LOT and also "duplicate" the information too much. I mean 
> "duplicate" as in this case, where the Factor 1 appears 
> millions of times for every Element:
> 
> Element 1 | Sub-factor 1 | Factor 1
> Element 2 | Subf-actor 1 | Factor 1
> ...
> Element 12639747465586 | Sub-factor 1 | Factor 1 Element 1 | 
> Sub-factor 2 | Factor 1
> 
> I hope not to robber you much time but... What do you think 
> about it? Is it better either 1 SELECT with 3 JOINs or build 
> nested "FOR ... IN SELECT ... LOOP" structures? Could it be 
> one of that cases in which I've to choose between either 
> higher speed but higher memory consume (3
> JOINs) or lower speed but less memory expense (3 FORs)?
> 
> Thanks again and apologizes for extending this topic too much.
> 
> 
> Scott Carey wrote:
> > On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:
> >
> >   
> >> negora wrote:
> >>
> >>     
> >>> According to how I understood the process, the engine 
> would get the 
> >>> name from the student with ID 1 and would look for the 
> name of the 
> >>> father with ID 1 in the hashed table. It'd do exactly the 
> same with 
> >>> the student #2 and father #2. But my big doubt is about 
> the 3rd one 
> >>> (Anthony). Would the engine "know" that it already had 
> retrieved the 
> >>> father's name for the student 1 and would avoid searching for it 
> >>> into the hashed table (using some kind of internal 
> mechanism which 
> >>> allows to "re-utilize" the name)? Or would it search into 
> the hashed 
> >>> table again?<br>
> >>>       
> >> The hash table is searched again.  But that's fast, because it's a 
> >> hash table.
> >>
> >>     
> >
> > To answer the question another way, "remembering" that it 
> has already seen father A once and tracking that would use a 
> hash table to remember that fact.  
> >
> > The hash table created by the first scan IS the "remember 
> you have seen this father" data structure, optimized for fast 
> lookup.  So before even looking at the first student, the 
> hash table is built so that it is fast to find out if a 
> father has been seen before, and if so where that father's 
> data is located.  Looking this data up is often referred to 
> as a "probe" and not a "scan" because it takes just as long 
> to do if the hash table has 100 entries or 10000 entries.  
> The drawback is that the whole thing has to fit in RAM.
> >
> >
> >   
> >> -- 
> >> Alvaro Herrera                                
> http://www.CommandPrompt.com/
> >> The PostgreSQL Company - Command Prompt, Inc.
> >>
> >> --
> >> Sent via pgsql-performance mailing list 
> >> (pgsql-performance@xxxxxxxxxxxxxx)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-performance
> >>     
> >
> >
> >   
> 

So, you are trying to do "nested loop" in PL/PgSQL.
Why not let optimizer decide between "nested loop" and "hash join" based
on your memory settings and statistics collected for objects involved?
I'm pretty sure, it'll be faster than PL/PgSQL 3 nested loops.

Igor Neyman

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance