Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/06/2012 11:35 AM, Gregory Farnum wrote:
On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@xxxxxxxxxx>  wrote:
On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:

On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@xxxxxxxxxx>    wrote:



The above suggests to me that the slowdown is a result
of requests not getting submitted at the same rate as
when things are running well.


Yeah, it really looks like that. My suggestions wouldn't help there.

I do see that when things go well the number of writes per device is
capped at ~200 writes per second and the throughput per device is
~100MB/sec. Is 100MB/sec the expected device throughput?


Pretty much, at least for the outer tracks on a drive.  I've seen
~108 MB/s with dd to a block device.  Also, I've got 8 drives per
SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
disk subsystem is any sort of significant bottleneck.

Well, you might try changing your throttling settings on the OSDs.
ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
and is used to limit the amount of client data in memory (ie; messages
are included in this throttler for their entire lifetime, not just
while waiting for dispatch).



I've made a little progress isolating this.

"osd client message size cap =  5000000" makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).

Such plots suggest to me my problem is caused by stalled
receives.  Using debug ms = 30 on my OSDs turns up instances
of this:

osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader got front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1448 of 2825073

which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.

I was able to correlate such stalls with tcpdump output collected on
clients.  Here's an example from another run:

15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr 1096266], length 1448
15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr 1096266], length 1448
15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr 1096266], length 1448
15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr 1096266], length 1448
15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr 1096266], length 1448
15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr 1096266], length 1448
15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr 1096266], length 1448
15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr 1096266], length 1448
15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1096266], length 1448
15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 2896
15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448

So in this case the client retransmitted for ~2 minutes with no response from
the OSD.  Note that during this time the client was talking to other OSDs on the
same server.

I want to try turning off GSO/GRO on my interfaces, but then
I think I need to post to netdev...

-- Jim

Attachment: osd.0.msg-et.svg.bz2
Description: application/bzip


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux