On 02/06/2012 11:35 AM, Gregory Farnum wrote:
On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@xxxxxxxxxx> wrote:On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@xxxxxxxxxx> wrote:The above suggests to me that the slowdown is a result of requests not getting submitted at the same rate as when things are running well.Yeah, it really looks like that. My suggestions wouldn't help there. I do see that when things go well the number of writes per device is capped at ~200 writes per second and the throughput per device is ~100MB/sec. Is 100MB/sec the expected device throughput?Pretty much, at least for the outer tracks on a drive. I've seen ~108 MB/s with dd to a block device. Also, I've got 8 drives per SAS adapter with 6 Gb/s links, so it seems unlikely to me that my disk subsystem is any sort of significant bottleneck.Well, you might try changing your throttling settings on the OSDs. ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB) and is used to limit the amount of client data in memory (ie; messages are included in this throttler for their entire lifetime, not just while waiting for dispatch).
I've made a little progress isolating this. "osd client message size cap = 5000000" makes the stall completely reproducible (which also means I can reproduce on two different network types, ethernet and IPoIB.), and I am able to generate graphs of throttled/receive/process time for each request received by an OSD (see attached SVG plot). Such plots suggest to me my problem is caused by stalled receives. Using debug ms = 30 on my OSDs turns up instances of this: osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000 osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader got front 128 osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304 osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1369231 of 4194304 osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073 osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1448 of 2825073 which I take to mean that the reader thread sat in poll() for 56 secs, in this case. I was able to correlate such stalls with tcpdump output collected on clients. Here's an example from another run: 15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0 15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0 15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0 15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856 15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856 15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0 15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856 15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856 15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr 1096266], length 1448 15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr 1096266], length 1448 15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr 1096266], length 1448 15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr 1096266], length 1448 15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr 1096266], length 1448 15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr 1096266], length 1448 15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr 1096266], length 1448 15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr 1096266], length 1448 15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1096266], length 1448 15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0 15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 2896 15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23887857}], length 0 15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448 15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448 15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448 So in this case the client retransmitted for ~2 minutes with no response from the OSD. Note that during this time the client was talking to other OSDs on the same server. I want to try turning off GSO/GRO on my interfaces, but then I think I need to post to netdev... -- Jim
Attachment:
osd.0.msg-et.svg.bz2
Description: application/bzip