Re: \COPY ... CSV with hex escapes

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Tue, 13 Jul 2010 12:09:03 +0800

On 08/07/10 17:42, Alban Hertroys wrote:
> On 8 Jul 2010, at 4:21, Craig Ringer wrote:
> 
>> Yes, that's ancient. It is handled quite happily by \copy in csv mode,
>> except that when csv mode is active, \xnn escapes do not seem to be
>> processed. So I can have *either* \xnn escape processing *or* csv-style
>> input processing.
>>
>> Anyone know of a way to get escape processing in csv mode?
> 
> 
> And what do those hex-escaped bytes mean? Are they in text strings? AFAIK CSV doesn't contain any information about what encoding was used to create it, so it could be about anything; UTF-8, Win1252, ISO-8859-something, or whatever Sybase was using.
> 
> I'm just saying, be careful what you're parsing there ;)

Thanks for that. In this case, the escapes are just "bytes" - what's
important is that, after unescaping, the CSV data is interpreted as
latin-1. OK, Windows-1252, but close enough.

In the end Python's csv module did the trick. I just pulled in the CSV
data, and spat out Postgresql-friendly COPY format so that I didn't need
to use the COPY ... CSV modifier and Pg would interpret the escapes
during input.

In case anyone else needs to deal with this format, here's the program I
used.

-- 
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/
#!/usr/bin/env python
import os
import sys
import csv

class DialectSybase(csv.Dialect):
    delimiter = ','
    doublequote = True
    escapechar = None
    quotechar = '\''
    quoting = csv.QUOTE_MINIMAL
    lineterminator = '\n'

class DialectPgCOPY(csv.Dialect):
    delimiter = '\t'
    doublequote = False
    escapechar = None
    quotechar = None
    quoting = csv.QUOTE_NONE
    lineterminator = '\n'

#class DialectPgCOPY(csv.Dialect):
#    delimiter = '\t'
#    doublequote = True
#    escapechar = '\\'
#    quotechar = '\''
#    quoting = csv.QUOTE_NONE
#    lineterminator = '\n'

def unescape_item(item):
    ''' noop so far '''
    #if item.find("\\X") >= 0:
        #print repr(item)
    #return item
    return item.replace("\\X","\\x")

def unescape_row(row):
    newrow = []
    for item in row:
        newitem = item
        if type(item) == str:
            newitem = unescape_item(item)
        newrow.append(newitem)
    return newrow

def main(infn, outfn):
    infile = open(infn,'r')
    outfile = open(outfn,'w')

    r = csv.reader( infile, dialect=DialectSybase )
    w = csv.writer( outfile, dialect=DialectPgCOPY )

    for row in r:
        w.writerow(unescape_row(row))

if __name__ == '__main__':
    print "customers"
    main('customer.txt', 'customer_unescaped.txt')
    print "class"
    main('class.txt', 'class_unescaped.txt')
    print "orders"
    main('orders.txt', 'orders_unescaped.txt')
    print "items"
    main('items.txt', 'items_unescaped.txt')
-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general