Re: import CSV file to a table

Rob Sargent <robjsargent@xxxxxxxxx> · Wed, 8 Mar 2017 09:24:19 -0700

    Since bash has been bandied about in this thread I presume
        awk is available.  Here's how I would check just how 'csv'ish
        the incoming file is.

      awk -F"," '{a[$NF]++}END{for(i in a){printf "%d lines have
          %d fields(columns)\n", a[i], i}}' csvfilename

    If this doesn't produce one line you have to suspect quoted
        values including commas (or what ever char you choose). then you
        need a real csv parser.

    If just one line, I'ld use cut to get rid of unwanted
        columns, then let COPY do it's thing

    On 03/08/2017 09:13 AM, Karl Czajkowski
      wrote:

      On Mar 08, John McKown modulated:
...

        I agree. I went with a "pure BASH" approach because it is what the
user asked for & I wasn't sure what language she might be comfortable
with. I use PERL a lot. Or maybe I should say that I abuse PERL a lot.
Such as a PERL script with writes out another PERL script, based on
some input files & parameters, then runs the just written PERL script,
which does the load into a PostgreSQL database (multiple tables). Ya, a
bit perverted.

      Well, you could follow a similar meta-programming/code-generating
pattern to have the BASH script output a single SQL file to run with
psql. You could even generate PL/pgsql code to defer more data
processing to the database itself.

I think the only robust "pure BASH" approach is to use a temporary
table, so you aren't trying to parse CSV content in BASH. Using csvkit
sounds good if you can introduce these third-party dependencies.

With the temporary table, you can use SQL for most validation or data
interrogation, but you need to know at least enough schema information
in advance to form the COPY statement. Parsing the CSV header row to
plan your work puts you right back to requiring a robust CSV parser
unless you can constrain your input scenarios to only handle very
trivial headers.

If you play games with a defaulting serial column and fixed column
names like "id, c1, c2, ..., cN" for the temporary table, you might
use the id column as a DB-assigned "row number" during COPY and
validation. In this case, you could even tell Postgres there is no
header, and then let it parse the header as another data record so you
can use SQL statements to determine the actual header names and
ordering in the input.  But this still requires knowing the column
count in advance of the COPY.

I also think using something like Python with structured data
processing would be wiser, unless you know enough about the schema in
advance to avoid any CSV parsing on the client side.

Karl