On Thu, May 5, 2016 at 2:13 PM, Johann Spies <johann.spies@xxxxxxxxx> wrote: > Dankie Arjen, > > On 29 April 2016 at 07:01, Arjen Nienhuis <a.g.nienhuis@xxxxxxxxx> wrote: > >> >> > The options I am considering is : >> > >> > 1. Unpack the individual records (will be more than 50 million) using >> > something like python with lxml and psycopg2 and insert them after dropping >> > all indexes and triggers on the table >> > >> > 2. Unpack the individual records and write a (very) large tsv-file and >> > then insert it using 'copy' >> > >> >> The fastest way I found is to combine these two. Using iterparse from lxml >> combined with load_rows and COPY from py-postgresql: >> >> http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements >> >> That way you can stream the data. > > > I did not know about py-postgresql as I am a Python-2.7 user. > > I am excited with the possiilities python-postgresql is offering. After a > bit of experimenting to use the streaming-copy option I landed up in the > same problem that I had while using python2: Postgresql rejecting the > xml-part of the record. > > How do you handle the conversion from string to bytes and back and the > presence of quotes within the xml? > > I have tried this to experiment with just 10 records for a start: > > ut = element.xpath('.//t:UID/text()',namespaces=namespaces)[0] > x = etree.tostring(element) > y = x.decode(encoding='UTF-8').replace("'", "''") > s = '\t'.join([gzipf,filename, ut,y]) > t = s.encode(encoding='UTF-8') > rows.append(t) > count += 1 > element.clear() > gc.collect() > if count == 10: > import pdb;pdb.set_trace() > xmlin.load_rows(rows) > rows = [] > f.close() > exit(0) > > Which ends up with an error: > > postgresql.exceptions.XMLContentError: invalid XML content > CODE: 2200N > LOCATION: File 'xml.c', line 1551, in xml_ereport from SERVER > CONTEXT: COPY annual, line 1, column xml: "<REC > xmlns="http://scientific.thomsonreuters.com/schema/wok5.4/public/FullRecord" > r_id_disclaimer="R..." > DETAIL: line 1: Premature end of data in tag REC line 1 If I need to guess it's because of \n in the xml. I used "WITH (FORMAT csv)" and quote the fields: def q(v): return b'"' + v.replace(b'"', b'""') + b'"' return b','.join(q(f) for f in fields) + b'\n' In the end I also had some other problems with the XML (namespaces), so I used: etree.tostring(element, method='c14n', exclusive=True) > > With python2.7 I tried to use the same technique using subprocess and a call > to psql to pipe the data to Postgresql -- ending with the same error. > Maybe you can show a few lines of the output. > Dankie nogmaals vir die verwysing na python-postgresql. > > Mooi dag. > > Johann > Groeten, Arjen -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general