Re: Benchmark Data requested --- pgloader CE design ideas

Dimitri Fontaine <dfontaine@xxxxxxxxxxxx> · Wed, 6 Feb 2008 13:36:51 +0100

Le mercredi 06 février 2008, Simon Riggs a écrit :
> For me, it would be good to see a --parallel=n parameter that would
> allow pg_loader to distribute rows in "round-robin" manner to "n"
> different concurrent COPY statements. i.e. a non-routing version.

What happen when you want at most N parallel Threads and have several sections 
configured: do you want pgloader to serialize sections loading (often there's 
one section per table, sometimes different sections target the same table) 
but parallelise each section loading?

I'm thinking we should have a global max_threads knob *and* and per-section 
max_thread one if we want to go this way, but then multi-threaded sections 
will somewhat fight against other sections (multi-threaded or not) for 
threads to use.

So I'll also add a parameter to configure how many (max) sections to load in 
parallel at any time.

We'll then have (default values presented):
max_threads = 1
max_parallel_sections = 1
section_threads = -1

The section_threads parameter would be overloadable at section level but would 
need to stay <= max_threads (if not, discarded, warning issued). When 
section_threads is -1, pgloader tries to have the higher number of them 
possible, still in the max_threads global limit.
If max_parallel_section is -1, pgloader start a new thread per each new 
section, maxing out at max_threads, then it waits for a thread to finish 
before launching a new section loading.

If you have N max_threads and max_parallel_sections = section_threads = -1, 
then we'll see some kind of a fight between new section threads and in 
section thread (the parallel non-routing COPY behaviour). But then it's a 
user choice.

Adding in it the Constraint_Exclusion support would not mess it up, but it'll 
have some interest only when section_threads != 1 and max_threads > 1.

> Making 
> that work well, whilst continuing to do error-handling seems like a
> challenge, but a very useful goal.

Quick tests showed me python threading model allows for easily sharing of 
objects between several threads, I don't think I'll need to adjust my reject 
code when going per-section multi-threaded. Just have to use a semaphore 
object to continue rejected one line at a time. Not that complex if reliable.

> Adding intelligence to the row distribution may be technically hard but
> may also simply move the bottleneck onto pg_loader. We may need multiple
> threads in pg_loader, or we may just need multiple sessions from
> pg_loader. Experience from doing the non-routing parallel version may
> help in deciding whether to go for the routing version.

If non-routing per-section multi-threading is a user request and not that hard 
to implement (thanks to python), that sounds a good enough reason for me to 
provide it :)

I'll keep you (and the list) informed as soon as I'll have the code to play 
with.
-- 
dim
Attachment:
signature.asc

Description: This is a digitally signed message part.