We saw the same performance problems when this new hardware was running cent 6.3 with a 2.6.32-279.19.1.el6.x86_64 kernel and when it was matched to the OS/kernel of the old hardware which was cent 5.8 with a 2.6.18-308.11.1.el5 kernel.
Yes the new hardware was thoroughly tested with bonnie before being put into services and has been tested since. We are unable to find any interesting differences in our bonnie tests comparisons between the old and new hardware. pgbench was not used prior to our discovery of the problem but has been used extensively since. FWIW This server ran a zabbix database (much lower load requirements) for a month without any problems prior to taking over as our primary production DB server.
After quite a bit of trial and error we were able to find a pgbench test (2x 300 concurrent client sessions doing selects along with 1x 50 concurrent user session doing the standard pgbench query rotation) that showed the new hardware under performing when compared to the old hardware to the tune of about a 1000 TPS difference (2300 to 1300) for the 50 concurrent user pgbench run and about a 1000 less TPS for each of the select only runs (~24000 to ~23000). Less demanding tests would be handled equally well by both old and new servers. More demanding tests would tip both old and new over with very similar efficacy.
Hopefully that fleshes things out a bit more.
Please let me know if I can provide additional information.
thanks
steve
On Fri, Mar 1, 2013 at 8:41 AM, Craig James <cjames@xxxxxxxxxxxxxx> wrote:
On Fri, Mar 1, 2013 at 1:52 AM, Steven Crandell <steven.crandell@xxxxxxxxx> wrote:
Recently I moved my ~600G / ~15K TPS database from a48 core@2.0GHz server with 512GB RAM on 15K RPM diskto a newer server with64 core@2.2Ghz server with 1T of RAM on 15K RPM disksThe move was from v9.1.4 to v9.1.8 (eventually also tested with v9.1.4 on the new hardware) and was done via base backup followed by slave promotion.All postgres configurations were matched exactly as were system and kernel parameters.On the first day that this server saw production load levels it absolutely fell on its face. We ran an exhaustive battery of tests including failing over to the new (hardware matched) slave only to find the problem happening there also.After several engineers all confirmed that every postgres and system setting matched, we eventually migrated back onto the original hardware using exactly the same methods and settings that had been used while the data was on the new hardware. As soon as we brought the DB live on the older (supposedly slower) hardware, everything started running smoothly again.
As far as we were able to gather in the frantic moments of downtime, hundreds of queries were hanging up while trying to COMMIT. This in turn caused new queries backup as they waited for locks and so on.Prior to failing back to the original hardware, we found interesting posts about people having problems similar to ours due to NUMA and several suggested that they had solved their problem by setting vm.zone_reclaim_mode = 0Unfortunately we experienced the exact same problems even after turning off the zone_reclaim_mode. We did extensive testing of the i/o on the new hardware (both data and log arrays) before it was put into service and have done even more comprehensive testing since it came out of service. The short version is that the disks on the new hardware are faster than disks on the old server. In one test run we even set the server to write WALs to shared memory instead of to the log LV just to help rule out i/o problems and only saw a marginal improvement in overall TPS numbers.At this point we are extremely confident that if we have a configuration problem, it is not with any of the usual postgresql.conf/sysctl.conf suspects. We are pretty sure that the problem is being caused by the hardware in some way but that it is not the result of a hardware failure (e.g. degraded array, raid card self tests or what have you).Given that we're dealing with new hardware and the fact that this still acts a lot like a NUMA issue, are there other settings we should be adjusting to deal with possible performance problems associated with NUMA?
Does this sound like something else entirely?Any thoughts appreciated.
One piece of information that you didn't supply ... sorry if this is obvious, but did you run the usual range of performance tests using pgbench, bonnie++ and so forth to confirm that the new server was working well before you put it into production? Did it compare well on those same tests to your old hardware?
Craigthanks,Steve