Search Postgresql Archives

Re: Using COPY to import large xml file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/25/2018 07:25 AM, Anto Aravinth wrote:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(

I use | as it is rarely found in data itself.


Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.

On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@xxxxxxxxx <mailto:theophilusx@xxxxxxxxx>> wrote:



    On Mon, 25 Jun 2018 at 11:38, Anto Aravinth
    <anto.aravinth.cse@xxxxxxxxx <mailto:anto.aravinth.cse@xxxxxxxxx>>
    wrote:



        On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross
        <theophilusx@xxxxxxxxx <mailto:theophilusx@xxxxxxxxx>> wrote:


            Anto Aravinth <anto.aravinth.cse@xxxxxxxxx
            <mailto:anto.aravinth.cse@xxxxxxxxx>> writes:

            > Thanks for the response. I'm not sure, how long does this tool takes for
            > the 70GB data.
            >
            > I used node to stream the xml files into inserts.. which was very slow..
            > Actually the xml contains 40 million records, out of which 10Million took
            > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
            > suggested on the internet.
            >
            > Definitely, will try the code and let you know.. But looks like it uses the
            > same INSERT, not copy.. interesting if it runs quick on my machine.
            >
            > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@xxxxxxxxxxxx <mailto:adrien.nayrat@xxxxxxxxxxxx>>
            > wrote:
            >
            >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
            >> > Hello Everyone,
            >> >
            >> > I have downloaded the Stackoverflow posts xml (contains all SO questions
            >> till
            >> > date).. the file is around 70GB.. I wanna import the data in those xml
            >> to my
            >> > table.. is there a way to do so in postgres?
            >> >
            >> >
            >> > Thanks,
            >> > Anto.
            >>
            >> Hello Anto,
            >>
            >> I used this tool :
            >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
            <https://github.com/Networks-Learning/stackexchange-dump-to-postgres>
            >>

            If you are using nodejs, then you can easily use the
            pg-copy-streams
            module to insert the records into your database. I've been
            using this
            for inserting large numbers of records from NetCDF files.
            Takes between
            40 to 50 minutes to insert 60 Million+ records and we are doing
            additional calculations on the values, not just inserting them,
            plus we are inserting into a database over the network and
            into a database which is
            also performing other processing.

            We found a significant speed improvement with COPY over
            blocks of insert
            transactions, which was faster than just individual inserts.
            The only
            downside with using COPY is that it either completely works or
            completely fails and when it fails, it can be tricky to work
            out which
            record is causing the failure. A benefit of using blocks of
            transactions
            is that you have more fine grained control, allowing you to
            recover from
            some errors or providing more specific detail regarding the
            cause of the
            error.


        Sure, let me try that.. I have a question here, COPY usually
        works when you move data from files to your postgres instance,
        right? Now in node.js, processing the whole file, can I use COPY
        programmatically like COPY Stackoverflow <calculated value at
        run time>? Because from doc:

        https://www.postgresql.org/docs/9.2/static/sql-copy.html
        <https://www.postgresql.org/docs/9.2/static/sql-copy.html>

        I don't see its possible. May be I need to convert the files to
        copy understandable first?

        Anto.




    Yes. Essentially what you do is create a stream and feed whatever
    information you want to copy into that stream. PG sees the. data as
    if it was seeing each line in a file, so you push data onto the
    stream wherre each item is seperated by a tab (or whatever). Here is
    the basic low level function I use (Don't know how the formatting
    will go!)

    async function copyInsert(sql, stringifyFN, records) {
       const logName = `${moduleName}.copyInsert`;
       var client;

       assert.ok(Array.isArray(records), "The records arg must be an
    array");
       assert.ok(typeof(stringifyFN) === "function", "The stringifyFN
    arg must be a function");
       return getClient()
         .then(c => {
           client = c;
           return new Promise(function(resolve, reject) {
             var stream, rs;
             var idx = 0;
             function done() {
               releaseClient(client);
               client = undefined;
               resolve(idx + 1);
             }

             function onError(err) {
               if (client !== undefined) {
                 releaseClient(client);
               }
               reject(new VError(err, `${logName}: COPY failed at record
    ${idx}`));
             }

             function arrayRead() {
               if (idx === records.length) {
                 rs.push(null);
               } else {
                 let rec = records[idx];
                 rs.push(stringifyFN(rec));
                 idx += 1;
               }
             }

             rs = new Readable;
             rs._read = arrayRead;
             rs.on("error", onError);
             stream = client.query(copyFrom(sql));
             stream.on("error", onError);
             stream.on("end", done);
             rs.pipe(stream);
           });
         })
         .catch(err => {
           throw new VError(err, `${logName} Failed COPY insert`);
         });
    }

    and I will call it like

    copyInsert(sql, stringifyClimateRecord, records)

    where sql and stringifycomateRecord arguments are

       const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
             + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
             + "vprp_09,vprp_15,wind_speed) FROM STDIN";

       function stringifyClimateRecord(rec) {
         return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
           +
    `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
       }

    The stringifyClimateRecord returns a record to be inserted as a
    'line' into the stream with values separated by tabs. Records is an
    array of data records where each record is an array.


-- regards,

    Tim

    --
    Tim Cross




--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux