On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: > Hi, > > Le lundi 04 février 2008, Jignesh K. Shah a écrit : > > Single stream loader of PostgreSQL takes hours to load data. (Single > > stream load... wasting all the extra cores out there) > > I wanted to work on this at the pgloader level, so CVS version of pgloader is > now able to load data in parallel, with a python thread per configured > section (1 section = 1 data file = 1 table is often the case). > Not configurable at the moment, but I plan on providing a "threads" knob which > will default to 1, and could be -1 for "as many thread as sections". That sounds great. I was just thinking of asking for that :-) I'll look at COPY FROM internals to make this faster. I'm looking at this now to refresh my memory; I already had some plans on the shelf. > > Multiple table loads ( 1 per table) spawned via script is bit better > > but hits wal problems. > > pgloader will too hit the WAL problem, but it still may have its benefits, or > at least we will soon (you can already if you take it from CVS) be able to > measure if the parallel loading at the client side is a good idea perf. wise. Should be able to reduce lock contention, but not overall WAL volume. > [...] > > I have not even started Partitioning of tables yet since with the > > current framework, you have to load the tables separately into each > > tables which means for the TPC-H data you need "extra-logic" to take > > that table data and split it into each partition child table. Not stuff > > that many people want to do by hand. > > I'm planning to add ddl-partitioning support to pgloader: > http://archives.postgresql.org/pgsql-hackers/2007-12/msg00460.php > > The basic idea is for pgloader to ask PostgreSQL about constraint_exclusion, > pg_inherits and pg_constraint and if pgloader recognize both the CHECK > expression and the datatypes involved, and if we can implement the CHECK in > python without having to resort to querying PostgreSQL, then we can run a > thread per partition, with as many COPY FROM running in parallel as there are > partition involved (when threads = -1). > > I'm not sure this will be quicker than relying on PostgreSQL trigger or rules > as used for partitioning currently, but ISTM Jignesh quoted § is just about > that. Much better than triggers and rules, but it will be hard to get it to work. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings