Re: postgres pg_bulkload c filter function in c programming

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Thu, 29 Dec 2016 21:14:36 +0100

On Thu, Dec 29, 2016 at 8:41 PM, rajmhn <rajmhn.ram@xxxxxxxxx> wrote:
> Thanks Francis.That seems to be a good solution.

Yep, but not for your problem as ...

>
> Thought to use pg_bulkload, a third party library instead of copy, where
> reject handling can be done in efficient way.

Mine was just an idea to do the part of the load you described
assuming pg_bulkload usage was optional. Not being it, it will not
work. MAYBE you can use the technique to preprocess the files for
pg_bulkload ( if possible this is nice, as the goood thing of
preprocessing them is you repeat until you get them right, no
DB-touchy ).

> Transformation(FILTER)
> functions can be implemented with any languages in pg_bulkload before it was
> loaded to table. SQL, C, PLs are ok, but you should write functions as fast
> as possible because they are called many times.

> In this case, function should be written in Perl and called inside the
> Postgressql function. Do you think that will work it out? But pg_bulkload is
> preferring C function over SQL function for performance.

I'm not familiar with pg_bulkload usage. I've read about it but all my
loading problemas have been solved better by using copy ( especially
factoring total time, I already know to use copy and a couple dozen
languages in which to write filters to preclean data for copy. In the
time I learn enough of pg_bulkload I can load filter and load a lot of
data ).

Regarding C vs perl, it seems pg_bulkload does server side processing.
In the server the funcion calling overhead is HUGE, specially when
transitioning between different languages. IMO the time spent doing
the data processing in perl would be 0 when compared with the time to
pass the data around to perl. C will be faster because the calling
barrier is smaller inside the server.

Just for data processing of things like you I've normally found
filters like the one I described can easily saturate an SSD array, and
the difference in time for processing is dwarfed by the difference in
time for developing the filter. In fact in any modern OS with write
through and readahead disk management the normal difference between
filtering in perl or C is perl may use 10% of 1 core, C 1%, perl
filter is developed in 15 minutes, C in an hour, and perl filter takes
some extra milliseconds to start. AND, if you are not familiar with
processing data in C you can easily code a slower solution than in
perl ( as perl was dessigned for this ).

> I will try this option as you suggested.

Just remember my option is not using pg_bulkload with perl stored
procedures. I cannot recommend anything if you use pg_bulkload.

I suggested using copy and perl to preclean the data. It just seemed
to me from the description of your problem you were using a too
complex tool. Now that you are introducing new terms, like reject
handling, I'll step out until I can make a sugestion ( don't bother to
define it for me, it seems a bulkload related term and I'm not able to
study that tool ).

FrancisCO Olarte.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general