Steve Crawford wrote:
okparanoid@xxxxxxx wrote:
Hello
i have a python script to update 600000 rows to one table from a csv
file in my
postgres database and it takes me 5 hours to do the transaction...
Let's see if I guessed correctly.
Your Python script is stepping through a 600,000 row file and updating
information in a table (of unknown rows/columns) by making 600,000
individual updates all wrapped in a big transaction. If correct, that
means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this
is correct, I'd first investigate simply loading the csv data into a
temporary table, creating appropriate indexes, and running a single
query to update your other table.
i can try this. The problem is that i have to make an insert if the
update don't have affect a rows (the rows don't exist yet). The number
of rows affected by insert is minor regards to the numbers of updated
rows and was 0 when i test my script). I can do with a temporary table
: update all the possible rows and then insert the rows that are in
temporary table and not in the production table with a 'not in'
statement. is this a correct way ?
First when i run htop i see that the memory used is never more than
150 MB.
I don't understand in this case why setting shmall and shmmax kernel's
parameters to 16 GB of memory (the server has 32 GB) increase the
rapidity of
the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
Are you saying that you did this and the performance improved or you
are wondering if it would?
Yes i did this and the perfomance improved. Dont understand why. Sorry
for my poor english...
The script is run with only one transaction and pause by moment to
let the time
to postgres to write data to disk.
This doesn't make sense. If the transaction completes successfully
then PostgreSQL has committed the data to disk (unless you have done
something non-standard and not recommended like turning off fsync). If
you are adding pauses between updates, don't do that - it will only
slow you down. If the full transaction doesn't complete, all updates
will be thrown away anyway and if it does complete then they were
committed.
Sorry, the pause is not caused by the python script but by postgres
himself. it does an average of +-3000 update and pause 2 min (htop say
me that postgres is in writing process don't really know if it does io
writing). I say that : if he writes to disk some things during the
transaction i don't understand why ?!
If the data were writed at the end of the transaction will be the
perfomance
better ? i wan't that in production data regulary writed to disk to
prevent
loosinf of data but it there any interest to write temporary data in
disk in a
middle of a transaction ???
See above. Actual disk IO is handled by the server. PostgreSQL is good
at the "D" in ACID. If your transaction completes, the data has been
written to disk. Guaranteed.
Cheers,
Steve
i try to say that in "normal" use (not when i run this maintenance
script) i want to be sure that by insert update request are write to
disk. They are small (1,2 or 3 rows affected) but they are a lot and
doing by many users. However just for this maintenance script i can
perhaps doing other tweak to adjust the io stress during the transaction ?!
Cheers,
Loic
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings