Re: Internal operations when the planner makes a hash join.

negora <negora@xxxxxxxxxx> · Tue, 23 Feb 2010 22:33:24 +0100

Thank you for explaining me the internal behaviour of the PostgreSQL 
engine. I'll try to look for more information about that hash tables. It 
sounds really really interesting. Your information was very useful.

The origin of my doubt resides in the fact that I need to do a joint 
between 3 HUGE tables (millions of registries) and do certain operations 
with the retrieved information. I was deciding whether to use one SELECT 
with 3 JOINs, as I've been doing since the beginning, or build a 
PL/PgSQL function based on 3 nested "FOR ... IN SELECT ... LOOP" 
structures which tried to minimize the subsequent table searches storing 
intermediate useful data in arrays (curiously, these would act as the 
hash tables which you mention, but in a very very rudimentary way). In a 
case like this one (possibly unable to fit in RAM), Is also JOIN the 
best solution?

Since I've to retrieve such a big amount of columns and crossed 
registries I had started to think that using 1 SELECT with 3 JOINs would 
increase the number of table searches a LOT and also "duplicate" the 
information too much. I mean "duplicate" as in this case, where the 
Factor 1 appears millions of times for every Element:

Element 1 | Sub-factor 1 | Factor 1
Element 2 | Subf-actor 1 | Factor 1
...
Element 12639747465586 | Sub-factor 1 | Factor 1
Element 1 | Sub-factor 2 | Factor 1

I hope not to robber you much time but... What do you think about it? Is 
it better either 1 SELECT with 3 JOINs or build nested "FOR ... IN 
SELECT ... LOOP" structures? Could it be one of that cases in which I've 
to choose between either higher speed but higher memory consume (3 
JOINs) or lower speed but less memory expense (3 FORs)?

Thanks again and apologizes for extending this topic too much.

Scott Carey wrote:
On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:

negora wrote:

According to how I understood the process, the engine would get the
name from the student with ID 1 and would look for the name of the
father with ID 1 in the hashed table. It'd do exactly the same with the
student #2 and father #2. But my big doubt is about the 3rd one
(Anthony). Would the engine "know" that it already had retrieved the
father's name for the student 1 and would avoid searching for it into
the hashed table (using some kind of internal mechanism which allows to
"re-utilize" the name)? Or would it search into the hashed table again?<br>

The hash table is searched again.  But that's fast, because it's a hash
table.

To answer the question another way, "remembering" that it has already seen father A once and tracking that would use a hash table to remember that fact.  

The hash table created by the first scan IS the "remember you have seen this father" data structure, optimized for fast lookup.  So before even looking at the first student, the hash table is built so that it is fast to find out if a father has been seen before, and if so where that father's data is located.  Looking this data up is often referred to as a "probe" and not a "scan" because it takes just as long to do if the hash table has 100 entries or 10000 entries.  The drawback is that the whole thing has to fit in RAM.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance