Le mercredi 06 février 2008, Simon Riggs a écrit : > For me, it would be good to see a --parallel=n parameter that would > allow pg_loader to distribute rows in "round-robin" manner to "n" > different concurrent COPY statements. i.e. a non-routing version. What happen when you want at most N parallel Threads and have several sections configured: do you want pgloader to serialize sections loading (often there's one section per table, sometimes different sections target the same table) but parallelise each section loading? I'm thinking we should have a global max_threads knob *and* and per-section max_thread one if we want to go this way, but then multi-threaded sections will somewhat fight against other sections (multi-threaded or not) for threads to use. So I'll also add a parameter to configure how many (max) sections to load in parallel at any time. We'll then have (default values presented): max_threads = 1 max_parallel_sections = 1 section_threads = -1 The section_threads parameter would be overloadable at section level but would need to stay <= max_threads (if not, discarded, warning issued). When section_threads is -1, pgloader tries to have the higher number of them possible, still in the max_threads global limit. If max_parallel_section is -1, pgloader start a new thread per each new section, maxing out at max_threads, then it waits for a thread to finish before launching a new section loading. If you have N max_threads and max_parallel_sections = section_threads = -1, then we'll see some kind of a fight between new section threads and in section thread (the parallel non-routing COPY behaviour). But then it's a user choice. Adding in it the Constraint_Exclusion support would not mess it up, but it'll have some interest only when section_threads != 1 and max_threads > 1. > Making > that work well, whilst continuing to do error-handling seems like a > challenge, but a very useful goal. Quick tests showed me python threading model allows for easily sharing of objects between several threads, I don't think I'll need to adjust my reject code when going per-section multi-threaded. Just have to use a semaphore object to continue rejected one line at a time. Not that complex if reliable. > Adding intelligence to the row distribution may be technically hard but > may also simply move the bottleneck onto pg_loader. We may need multiple > threads in pg_loader, or we may just need multiple sessions from > pg_loader. Experience from doing the non-routing parallel version may > help in deciding whether to go for the routing version. If non-routing per-section multi-threading is a user request and not that hard to implement (thanks to python), that sounds a good enough reason for me to provide it :) I'll keep you (and the list) informed as soon as I'll have the code to play with. -- dim
Attachment:
signature.asc
Description: This is a digitally signed message part.