On Fri, Apr 26, 2013 at 7:01 PM, Gavin Flower <GavinFlower@xxxxxxxxxxxxxxxxx> wrote: > Please do not top post, the convention in these list are to add stuff at the > end, apart from comments interspersed to make use of appropriate context! Noted, thanks. Anyway, any performance hints are greatly appreciated. > > > On 27/04/13 13:35, Yang Zhang wrote: >> >> We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on >> m1.xlarge instances, which have: >> >> 15 GiB memory >> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) >> 64-bit platform >> >> (Yes, we're moving to EBS Optimized instances + Provisioned IOPS >> volumes, but prelim. benchmarks suggest this won't get us enough of a >> boost as much as possibly refactoring the way we're executing these >> bulk updates in our application.) >> >> On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower >> <GavinFlower@xxxxxxxxxxxxxxxxx> wrote: >>> >>> On 27/04/13 12:14, Yang Zhang wrote: >>> >>> It currently takes up to 24h for us to run a large set of UPDATE >>> statements on a database, which are of the form: >>> >>> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >>> id = constid >>> >>> (We're just overwriting fields of objects identified by ID.) >>> >>> The tables have handfuls of indices each and no foreign key constraints. >>> No COMMIT is made till the end. >>> >>> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >>> baseline we should reasonably target. >>> >>> Short of producing a custom program that somehow reconstructs a dataset >>> for Postgresql to re-import, is there anything we can do to bring the >>> bulk UPDATE performance closer to that of the import? (This is an area >>> that we believe log-structured merge trees handle well, but we're >>> wondering if there's anything we can do within Postgresql.) >>> >>> Some ideas: >>> >>> - dropping all non-ID indices and rebuilding afterward? >>> - increasing checkpoint_segments, but does this actually help sustained >>> long-term throughput? >>> - using the techniques mentioned here? (Load new data as table, then >>> "merge in" old data where ID is not found in new data) >>> >>> >>> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@xxxxxxxxxxxxxx> >>> >>> Basically there's a bunch of things to try and we're not sure what the >>> most effective are or if we're overlooking other things. We'll be >>> spending the next few days experimenting, but we thought we'd ask here >>> as well. >>> >>> Thanks. >>> >>> >>> People will need to know your version of Postgres & which Operating >>> System >>> etc. plus details of CPU RAM, and Disks... AS well as what changes you >>> have >>> made to postgresql.conf... >>> >>> I would be inclined to DROP all indexes and reCREATE them later. >>> >>> Updating a row might lead to new row being added in a new disk page, so I >>> suspect that updates will hit every index associated with the table with >>> the >>> (possible exception of partial indexes). >>> >>> Running too many updates in one transaction, may mean that Postgres may >>> need >>> to use disk work files. >>> >>> Depending on RAM etc, it may pay to increase some variables tat affect >>> how >>> Postgres uses RAM, some of these are per session. >>> >>> >>> Cheers, >>> Gavin >> >> >> >> -- >> Yang Zhang >> http://yz.mit.edu/ > > -- Yang Zhang http://yz.mit.edu/ -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general