Hi Steven,
Sounds very familiar. Painfully familiar :(
But I really don't know. All I can see is that in this particular configuration the instance has 2 x Intel Xeon E5-2670, eight-core processors. I can't find any info on whether it's flex or round robin. AWS typically don't make the underlying hardware known. The exception is on the chip-types on the higher-end instance types which is where I got the info above from.
Below is an excerpt from atop when the problem occur. The CPUs jump to high sys usage, not sure if that was similar to what you saw?
How did you get it resolved in the end?
ATOP - ip-10-155-231-112 2013/04/02 01:25:40 ------ 2s elapsed
59;169H 0 70.15s | | user 8.19s | | | | | #proc 1015 | | #zombie 0 | | clones 0 | | | | | #exit 2 |
CPU | sys 3182% | | user 30% | | irq 1% | | | | idle 0% | | wait 0% | | | | steal 0% | | guest 0% |
cpu | sys 98% | | user 1% | | irq 1% | | | | idle 0% | | cpu000 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 96% | | user 4% | | irq 0% | | | | idle 0% | | cpu001 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu002 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 99% | | user 1% | | irq 0% | | | | idle 0% | | cpu003 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu004 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu005 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 98% | | user 2% | | irq 0% | | | | idle 0% | | cpu006 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 99% | | user 1% | | irq 0% | | | | idle 0% | | cpu007 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 99% | | user 1% | | irq 0% | | | | idle 0% | | cpu008 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu009 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 99% | | user 1% | | irq 0% | | | | idle 0% | | cpu010 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu011 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 99% | | user 1% | | irq 0% | | | | idle 0% | | cpu012 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 97% | | user 3% | | irq 0% | | | | idle 0% | | cpu013 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu014 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu015 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu016 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 82% | | user 18% | | irq 0% | | | | idle 0% | | cpu017 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu018 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu019 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu020 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu021 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu022 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu023 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu024 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu025 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu026 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu027 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 99% | | user 1% | | irq 0% | | | | idle 0% | | cpu028 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu029 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 100% | | user 0% | | irq 0% | | | | idle 0% | | cpu030 w 0% | | | | steal 0% | | guest 0% |
cpu | sys 99% | | user 1% | | irq 0% | | | | idle 0% | | cpu031 w 0% | | | | steal 0% | | guest 0% |
CPL | avg1 90.60 | | avg5 60.80 | | | avg15 39.77 | | | | csw 1011 | | intr 17568 | | | | | numcpu 32 |
MEM | tot 58.5G | | free 418.4M | | cache 45.0G | dirty 0.6M | | buff 5.8M | | slab 501.2M | | | | | | | |
SWP | tot 0.0M | | free 0.0M | | | | | | | | | | | | vmcom 49.8G | | vmlim 29.3G |
PAG | scan 1858 | | | | stall 0 | | | | | | | | swin 0 | | | | swout 0 |
NET | transport | tcpi 318 | | tcpo 392 | udpi 34 | | udpo 39 | tcpao 0 | | tcppo 2 | tcprs 0 | | tcpie 0 | tcpor 0 | | udpnp 0 | udpip 0 |
NET | network | | ipi 357 | | ipo 397 | ipfrw 0 | | deliv 357 | | | | | | | icmpi 0 | | icmpo 0 |
NET | eth0 ---- | | pcki 318 | pcko 358 | | si 200 Kbps | so 947 Kbps | | coll 0 | | mlti 0 | erri 0 | | erro 0 | drpi 0 | | drpo 0 |
NET | lo ---- | | pcki 39 | pcko 39 | | si 79 Kbps | so 79 Kbps | | coll 0 | | mlti 0 | erri 0 | | erro 0 | drpi 0 | | drpo 0 |
debug2: channel 0: window 997757 sent adjust 50819
On Tue, Apr 2, 2013 at 3:07 AM, Steven Crandell <steven.crandell@xxxxxxxxx> wrote:
Armand,All of the symptoms you describe line up perfectly with a problem I had recently when upgrading DB hardware.Everything ran find until we hit some threshold somewhere at which point the locks would pile up in the thousands just as you describe, all while we were not I/O bound.I was moving from a DELL 810 that used a flex memory bridge to a DELL 820 that used round robin on their quad core intels.(Interestingly we also found out that DELL is planning on rolling back to the flex memory bridge later this year.)
Any chance you could find out if your old processors might have been using flex while you're new processors might be using round robin?-s