Re: optimizing import of large CSV file into partitioned table?

Nagy Zoltan <kirk@xxxxxxxx> · Mon, 29 Mar 2010 02:23:15 +0200

hi,

i would recommend to convert the input using some scripts into pg_dump format and use small temporary tables
without indexes to import into - and after all data in the db you could partition it...

you can pre-partition your data using simple grep, this way you can import the data directly into a partitioned scheme

kirk

Thom Brown wrote:
> On 28 March 2010 18:33, Rick Casey <caseyrick@xxxxxxxxx
> <mailto:caseyrick@xxxxxxxxx>> wrote:
> 
>     After careful research, I would to post the following problem I'm
>     having with the importing of a large (16Gb) CSV file. Here is brief
>     synopsis:
>     - this is running on Postgres (PG) version: PostgreSQL 8.3.9 on
>     i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu
>     4.3.2-1ubuntu11) 4.3.2
>     - it is running on a Ubuntu (small) server instance at Amazon Web
>     Services (AWS), with a 320Gb volume mounted for the PG data directory
>     - the database was created using the partition example in the
>     documentation, with an insert trigger and a function to direct which
>     table where records get inserted.
>     (see below for code on my table and trigger creation)
> 
>     After some days of attempting to import the full 16Gb CSV file, I
>     decided to split the thing up, using the split utility in Linux.
>     This seemed to improve things; once I had split the CSV files into
>     about 10Mb size files, I finally got my first successful import of
>     about 257,000 recs. However, this is going to be a rather labor
>     intensive process to import the full 16Gb file, if I have to
>     manually split it up, and import each smaller file separately.
> 
>     So, I am wondering if there is any to optimize this process? I have
>     been using Postgres for several years, but have never had to
>     partition or optimize it for files of this size until now. 
>     Any comments or suggestions would be most welcomed from this
>     excellent forum.
> 
>     (I might add that I spend several weeks prior to this trying to get
>     this to work in MySQL, which I finally had to abandon.)
> 
>     Sincerely,
>     Rick
> 
>     Details of the code follow:
> 
>     Here is the basic COPY command, which I run as the postgres user, to
>     import the CSV files:
>     <begin>
>     COPY allcalls FROM '/data/allcalls-csv/sub3ab' WITH CSV;
> 
>     Here is what some sample data look like in the files:
>     3153371867,2008-02-04 16:11:00,1009,1,40
>     2125673062,2008-02-04 16:11:00,1009,1,41
>     5183562377,2008-02-04 16:11:00,1009,1,50
>     ...
> 
>     Here are the basic scripts that created the partition table and
>     insert trigger:
>     CREATE TABLE allcalls (
>     phonenum bigint,
>     callstarted timestamp without time zone,
>     status int,
>     attempts int,
>     duration int
>     ); 
>     CREATE TABLE allcalls_0 (
>         CHECK ( phonenum < 1000000000 )
>     ) INHERITS (allcalls);
>     ...(repeat this 9 more times, for 10 subpartition tables)
> 
>     CREATE INDEX allcalls_0_phonenum ON allcalls_0 (phonenum);
>     ..(repeat this 9 more times, for indexes on the 10 subpartition tables)
>     CREATE OR REPLACE FUNCTION allcalls_insert_trigger()
>     RETURNS TRIGGER AS $$
>     BEGIN
>     IF ( NEW.phonenum  < 1000000000 ) THEN
>             INSERT INTO allcalls_0 VALUES (NEW.*);
>         ELSIF ( NEW.phonenum >= 1000000000 AND NEW.phonenum < 2000000000
>     ) THEN
>             INSERT INTO allcalls_1 VALUES (NEW.*);
>     ...(again, repeat for rest of the parition tables)
> 
>     CREATE TRIGGER insert_phonenum_trigger
>         BEFORE INSERT ON allcalls
>         FOR EACH ROW EXECUTE PROCEDURE allcalls_insert_trigger();
> 
>     <end>
> 
> 
> The problem here is that you appear to require an index update, trigger
> firing and constraint check for every single row.  First thing I'd
> suggest is remove the indexes.  Apply that after your import, otherwise
> it'll have to update the index for every single entry.  And the trigger
> won't help either.  Import into a single table and split it out into
> further tables after if required.  And finally the constraint should
> probably be applied after too, so cull any violating rows after importing.
> 
> Thom

-- 
Nagy Zoltan (kirk) <kirk@xxxxxxxx>

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general