Re: copy from questions

Steve Crawford <scrawford@xxxxxxxxxxxxxxxxxxxx> · Wed, 19 Dec 2012 13:46:48 -0800



    On 12/19/2012 01:06 PM, Kirk Wythers
      wrote:

    
      Thanks for the reply Steve. These suggestions are new to me, so
      I'd like to rephrase them back to you in order to make sure I
      understand the bits and details. 
      

            On Dec 19, 2012, at 10:48 AM, Steve Crawford <scrawford@xxxxxxxxxxxxxxxxxxxx>
              wrote:
            
              
              I suppose you could use a trigger to check each record
              before inserting but that is likely to be inefficient for
              bulk loads. A quick bash loop is probably your best bet.
              Something along the lines of:

              
              for inputfile in /infiledirectory/*.csv

              do

                 cat inputfile | psql [connection-params] -c '\copy
              rawinput from stdin csv header...'

              done

            
            I get this… If my except for the header… bit. Here is
              my interpretation of your code including my specific
              connection parameters. 
            

              #!/bin/sh
              for inputfile in
                  '/Volumes/disk7/b4warmed3/export/60min2/*.txt'

                do

                   cat inputfile | psql -p 54321 -h localhost -c
                  '\copy rawinput FROM stdin WITH CSV HEADER DELIMTER AS
                  ',' NULL AS 'NA '

                done
            
          
    You have a mixture of bash and PostgreSQL/psql issues.

    
    Bash first: You don't need to wrap the inputfile glob in quotes.
    Your filespec will be expanded to a list of the files matching your
    pattern. I'm assuming that you have not populated your directory
    with filenames that contain spaces. If so, either correct it or
    start Googling for how to deal with lists of space-containing file
    names.

    
    Also, you will have problems with using single-quotes within a
    string delimited by single-quotes. Combine that with the psql
    problem where it appears you have to specify delimiter before CSV
    (though you probably don't need to specify delimiter at all if your
    delimiter is a comma) and things will blow up. Try:

    
    .... -c "\copy rawinput FROM stdin with delimiter as ',' NULL AS 'NA
    ' CSV HEADER"

    
            I added single quotes around the path to the input
              files. Correct right?
            

              This imports everything into a "staging" table (I called
              it rawinput). From there you can create your final table
              with SELECT DISTINCT…
            

            This bit needs to be as a separate step right? (rowid
              is the primary key)
            

            SELECT DISTINCT ON (rowid) *
            
                FROM rawinput;
            

            From here do I need another COPY FROM or some kind of
              INSERT statement? 

            
    I had assumed you had fully duplicated rows in which you could
    simply do;

    INSERT INTO realtable SELECT DISTINCT * FROM rawinput;

    (assuming realtable and rawinput have exactly the same structure and
    column ordering, of course)

    
    If that is not the case, you will need to update your select
    statement to return one from among your duplicated IDs based on your
    desired criteria.

    
              For speed make sure that you create your staging table as
              "unlogged".

            
            I understand that I need to create the rawinput table
              first, but I am unfamiliar with the "UNLOGGED" option. I
              assume it makes things faster… Does it go something like:
            

            CREATE TABLE UNLOGGED rawinput; 

              
    Something like but not exactly. The syntax is CREATE UNLOGGED
    TABLE... Except for UNLOGGED before TABLE it's exactly like a
    regular create-table statement.

    
    BEWARE: Unlogged is *NOT* a magic go-fast directive that comes at no
    cost. In normal tables data is written to the transaction log
    allowing the database to recover in the event of an unplanned
    shutdown. Unlogged tables, as the name implies, skip the writing of
    the write-ahead log gaining speed at the expense of data safety. For
    bulk-load operations that can be redone if they fail this is a
    reasonable approach. For normal operations where you expect your
    data to be well protected it is not. Depending on a variety of
    factors the speed improvement you will see ranges from "minimal" to
    "substantial"

    
            Do I need to create all the variables (including types)
              in rawinput as well? If so, then I assume that I do not
              want rowid to have a primary key… or else I would be back
              where I started. 

            
    I would make rawinput exactly the same as your real table except
    without indexes or constraints. The easiest way is probably:

    CREATE UNLOGGED TABLE rawinput as SELECT * FROM realtable limit 0;

    
              Cheers,

              Steve

              
      Thanks again,
      

      Kirk
      

    Cheers,

    Steve