Re: xml-file as foreign table?

Arjen Nienhuis <a.g.nienhuis@xxxxxxxxx> · Fri, 6 May 2016 10:23:46 +0200

On Thu, May 5, 2016 at 2:13 PM, Johann Spies <johann.spies@xxxxxxxxx> wrote:
> Dankie Arjen,
>
> On 29 April 2016 at 07:01, Arjen Nienhuis <a.g.nienhuis@xxxxxxxxx> wrote:
>
>>
>> > The options I am considering is :
>> >
>> > 1. Unpack the individual records (will be more than 50 million) using
>> > something like python with lxml and psycopg2 and insert them after dropping
>> > all indexes and triggers on the table
>> >
>> > 2. Unpack the individual records and write a (very) large tsv-file and
>> > then insert it using 'copy'
>> >
>>
>> The fastest way I found is to combine these two. Using iterparse from lxml
>> combined with load_rows and COPY from py-postgresql:
>>
>> http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements
>>
>> That way you can stream the data.
>
>
> I did not know about py-postgresql as I am a Python-2.7 user.
>
> I am excited with the possiilities python-postgresql is offering. After a
> bit of experimenting to use the streaming-copy option I landed up in the
> same problem that I had while using python2: Postgresql rejecting the
> xml-part of the record.
>
> How do you handle the conversion from string to bytes and back and the
> presence of quotes within the xml?
>
> I have tried this to experiment with just 10 records for a start:
>
> ut = element.xpath('.//t:UID/text()',namespaces=namespaces)[0]
>                 x = etree.tostring(element)
>                 y = x.decode(encoding='UTF-8').replace("'", "''")
>                 s  = '\t'.join([gzipf,filename, ut,y])
>                 t =  s.encode(encoding='UTF-8')
>                 rows.append(t)
>                 count += 1
>                 element.clear()
>                 gc.collect()
>                 if count == 10:
>                     import pdb;pdb.set_trace()
>                     xmlin.load_rows(rows)
>                     rows = []
>                     f.close()
>                     exit(0)
>
> Which ends up with an error:
>
> postgresql.exceptions.XMLContentError: invalid XML content
>   CODE: 2200N
>   LOCATION: File 'xml.c', line 1551, in xml_ereport from SERVER
>   CONTEXT: COPY annual, line 1, column xml: "<REC
> xmlns="http://scientific.thomsonreuters.com/schema/wok5.4/public/FullRecord";
> r_id_disclaimer="R..."
>   DETAIL: line 1: Premature end of data in tag REC line 1

If I need to guess it's because of \n in the xml. I used "WITH (FORMAT
csv)" and quote the fields:

def q(v):
     return b'"' + v.replace(b'"', b'""') + b'"'

return b','.join(q(f) for f in fields) + b'\n'

In the end I also had some other problems with the XML (namespaces), so I used:

etree.tostring(element, method='c14n', exclusive=True)

>
> With python2.7 I tried to use the same technique using subprocess and a call
> to psql to pipe the data to Postgresql -- ending with the same error.
>

Maybe you can show a few lines of the output.

> Dankie nogmaals vir die verwysing na python-postgresql.
>
> Mooi dag.
>
> Johann
>

Groeten, Arjen

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general