Re: copy vs. C function

Merlin Moncure <mmoncure@xxxxxxxxx> · Mon, 12 Dec 2011 10:38:04 -0600

On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> wrote:
> I was experimenting with a few different methods of taking a line of
> text, parsing it, into a set of fields, and then getting that info
> into a table.
>
> The first method involved writing a C program to parse a file, parse
> the lines and output newly-formatted lines in a format that
> postgresql's COPY function can use.
> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
> output new data to new file -- 4 seconds, COPY new file -- 10
> seconds).
>
> The next approach I took was to write a C function in postgresql to
> parse a single TEXT datum into an array of C strings, and then use
> BuildTupleFromCStrings. There are 8 columns involved.
> Eliding the time it takes to COPY the (raw) file into a temporary
> table, this method took 120 seconds, give or take.
>
> The difference was /quite/ a surprise to me. What is the probability
> that I am doing something very, very wrong?
>
> NOTE: the code that does the parsing is actually the same,
> line-for-line, the only difference is whether the routine is called by
> a postgresql function or by a C program via main, so obviously the
> overhead is elsewhere.
> NOTE #2: We are talking about approximately 2.6 million lines.

Let me throw out an interesting third method I've been using to parse
delimited text files that might be useful in your case.  This is
useful when parsing text that is bad csv where values are not escaped
or there are lines, incomplete and/or missing records, or a huge
amount of columns that you want to rotate into a more normalized
structure based on columns position.

1. Import the data into a single column (containing the entire line)
staging table, feeding the COPY parser a bogus delimiter
2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
you can work it into your problem, INSERT/SELECT, expanding the array
with a trick like used in information_schema._pg_expandarray so you
can hook logic on the array (column position).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance