> -----Original Message----- > From: negora [mailto:negora@xxxxxxxxxx] > Sent: Tuesday, February 23, 2010 4:33 PM > To: Scott Carey > Cc: Alvaro Herrera; pgsql-performance@xxxxxxxxxxxxxx > Subject: Re: Internal operations when the planner makes a hash join. > > Thank you for explaining me the internal behaviour of the > PostgreSQL engine. I'll try to look for more information > about that hash tables. It sounds really really interesting. > Your information was very useful. > > The origin of my doubt resides in the fact that I need to do > a joint between 3 HUGE tables (millions of registries) and do > certain operations with the retrieved information. I was > deciding whether to use one SELECT with 3 JOINs, as I've been > doing since the beginning, or build a PL/PgSQL function based > on 3 nested "FOR ... IN SELECT ... LOOP" > structures which tried to minimize the subsequent table > searches storing intermediate useful data in arrays > (curiously, these would act as the hash tables which you > mention, but in a very very rudimentary way). In a case like > this one (possibly unable to fit in RAM), Is also JOIN the > best solution? > > Since I've to retrieve such a big amount of columns and > crossed registries I had started to think that using 1 SELECT > with 3 JOINs would increase the number of table searches a > LOT and also "duplicate" the information too much. I mean > "duplicate" as in this case, where the Factor 1 appears > millions of times for every Element: > > Element 1 | Sub-factor 1 | Factor 1 > Element 2 | Subf-actor 1 | Factor 1 > ... > Element 12639747465586 | Sub-factor 1 | Factor 1 Element 1 | > Sub-factor 2 | Factor 1 > > I hope not to robber you much time but... What do you think > about it? Is it better either 1 SELECT with 3 JOINs or build > nested "FOR ... IN SELECT ... LOOP" structures? Could it be > one of that cases in which I've to choose between either > higher speed but higher memory consume (3 > JOINs) or lower speed but less memory expense (3 FORs)? > > Thanks again and apologizes for extending this topic too much. > > > Scott Carey wrote: > > On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote: > > > > > >> negora wrote: > >> > >> > >>> According to how I understood the process, the engine > would get the > >>> name from the student with ID 1 and would look for the > name of the > >>> father with ID 1 in the hashed table. It'd do exactly the > same with > >>> the student #2 and father #2. But my big doubt is about > the 3rd one > >>> (Anthony). Would the engine "know" that it already had > retrieved the > >>> father's name for the student 1 and would avoid searching for it > >>> into the hashed table (using some kind of internal > mechanism which > >>> allows to "re-utilize" the name)? Or would it search into > the hashed > >>> table again?<br> > >>> > >> The hash table is searched again. But that's fast, because it's a > >> hash table. > >> > >> > > > > To answer the question another way, "remembering" that it > has already seen father A once and tracking that would use a > hash table to remember that fact. > > > > The hash table created by the first scan IS the "remember > you have seen this father" data structure, optimized for fast > lookup. So before even looking at the first student, the > hash table is built so that it is fast to find out if a > father has been seen before, and if so where that father's > data is located. Looking this data up is often referred to > as a "probe" and not a "scan" because it takes just as long > to do if the hash table has 100 entries or 10000 entries. > The drawback is that the whole thing has to fit in RAM. > > > > > > > >> -- > >> Alvaro Herrera > http://www.CommandPrompt.com/ > >> The PostgreSQL Company - Command Prompt, Inc. > >> > >> -- > >> Sent via pgsql-performance mailing list > >> (pgsql-performance@xxxxxxxxxxxxxx) > >> To make changes to your subscription: > >> http://www.postgresql.org/mailpref/pgsql-performance > >> > > > > > > > So, you are trying to do "nested loop" in PL/PgSQL. Why not let optimizer decide between "nested loop" and "hash join" based on your memory settings and statistics collected for objects involved? I'm pretty sure, it'll be faster than PL/PgSQL 3 nested loops. Igor Neyman -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance