My guess would be that your server upgrade wasn't the upgrade you thought it was.
You network latency could definitely be the cause of most of this. The problem is you're not measuring this from the server side. It's not only going to impact connect time, but you're going to get your data a bit slower as well. I'm assuming your pgbouncer is installed on the postgres box. Do you have the pgbouncer addon installed? If you're looking to isolate network transfer, pgbouncer should give the statistics like avg query time to see if they line up with what you're seeing in your APM.
Also, you reduced your number of cores, which is a big problem because basically you just cut your max query capacity in half. Assuming a dedicated box, each CPU can only process up to one postgres query at a time. Previously, you could process up to 8 queries simultaneously, whereas now you can only do 4. Now, since most of your queries are probably in the ms, that can still be quite a bit of queries in a second time frame and you may never hit 4 going at the same exact time, but without seeing your pgbouncer config, this may actually be happening based on all the idle connections you're seeing.
If ALL your connections are coming from a single pgbouncer locally on the postgres box, then you can use your server resources better by setting max_connections to 6 (the number of cores + one or two more for you to connect locally via pgsql). Then, set your default_pool_size in pgbouncer to 4 (number of cores) and reserve_pool_size to 0, and restart. This will keep the number of sessions limited to what your box is actually capable of doing and will help you avoid loading postgres down with more than it's capable of doing, which can make a bad situation worse.
My recommendation would be to go back to the old server if it's available. If not, get a new one in the same data center as your web servers with at least 8 cores to put you back where you were.
On Fri, Nov 4, 2016 at 7:55 AM Benjamin Toueg <btoueg@xxxxxxxxx> wrote:
I've noticed a network latency increase. Ping between web server and database : 0.6 ms avg before, 5.3 ms avg after -- it wasn't that big 4 days ago :(I've narrowed my investigation to one particular "Transaction" in terms of the NewRelic APM. It's basically the main HTTP request of my application.Looks like the ping impacts psycopg2:connect (see http://imgur.com/a/LDH1c): 4 ms up to 16 ms on average.That I can understand. However, I don't understand the performance decrease of the select queries on table1 (see https://i.stack.imgur.com/QaUqy.png): 80 ms up to 160 ms on averageSame goes for table 2 (see http://imgur.com/a/CnETs): 4 ms up to 20 ms on averageHowever, there is a commit in my request, and it performs better (see http://imgur.com/a/td8Dc): 12 ms down to 6 ms on average.I don't see how this can be due to network latency!I will provide a new bonnie++ benchmark when the requests per minute is at the lowest (remember I can only run benchmarks while the server is in use).Rick, what did you mean by kernel configuration? The OS is a standard Ubuntu 16.04:- Linux 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016 x86_64 x86_64 x86_64 GNU/LinuxDo you think losing half the number of cores can explain my performance issue ? (AMD 8 cores down to Haswell 4 cores).Best Regards,BenjaminPS : I've edited the SO post http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better2016-11-04 1:05 GMT+01:00 Kevin Grittner <kgrittn@xxxxxxxxx>:On Thu, Nov 3, 2016 at 9:51 AM, Benjamin Toueg <btoueg@xxxxxxxxx> wrote:
>
> Stream gives substantially better results with the new server (before/after)
Yep, the new server can access RAM at about twice the speed of the old.
> I've run "bonnie++ -u postgres -d /tmp/ -s 4096M -r 1096" on both
> machines. I don't know how to read bonnie++ results (before/after)
> but it looks quite the same, sometimes better for the new,
> sometimes better for the old.
On most metrics the new machine looks better, but there are a few
things that look potentially problematic with the new machine: the
new machine uses about 1.57x the CPU time of the old per block
written sequentially ((41 / 143557) / (16 / 87991)); so if the box
becomes CPU starved, you might notice writes getting slower than on
the new box. Also, several of the latency numbers are worse -- in
some cases far worse. If I'm understanding that properly, it
suggests that while total throughput from a number of connections
may be better on the new machine, a single connection may not run
the same query as quickly. That probably makes the new machine
better for handling an OLTP workload from many concurrent clients,
but perhaps not as good at cranking out a single big report or
running dump/restore.
Yes, it is quite possible that the new machine could be faster at
some things and slower at others.