Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 9, 2012 at 3:53 PM, Jim Schutt <jaschut@xxxxxxxxxx> wrote:
> On 02/06/2012 11:35 AM, Gregory Farnum wrote:
>>
>> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@xxxxxxxxxx>  wrote:
>>>
>>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>>>
>>>>
>>>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@xxxxxxxxxx>    wrote:
>>>
>>>
>>>
>>>>>
>>>>> The above suggests to me that the slowdown is a result
>>>>> of requests not getting submitted at the same rate as
>>>>> when things are running well.
>>>>>
>>>>
>>>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>>>
>>>> I do see that when things go well the number of writes per device is
>>>> capped at ~200 writes per second and the throughput per device is
>>>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>>>
>>>
>>>
>>> Pretty much, at least for the outer tracks on a drive.  I've seen
>>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>>> disk subsystem is any sort of significant bottleneck.
>>
>>
>> Well, you might try changing your throttling settings on the OSDs.
>> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
>> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
>> and is used to limit the amount of client data in memory (ie; messages
>> are included in this throttler for their entire lifetime, not just
>> while waiting for dispatch).
>>
>>
>
> I've made a little progress isolating this.
>
> "osd client message size cap =  5000000" makes the stall
> completely reproducible (which also means I can reproduce
> on two different network types, ethernet and IPoIB.), and I
> am able to generate graphs of throttled/receive/process time
> for each request received by an OSD (see attached SVG plot).
>
> Such plots suggest to me my problem is caused by stalled
> receives.  Using debug ms = 30 on my OSDs turns up instances
> of this:
>
> osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000
> osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader got front 128
> osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
> osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
> osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader read 1369231 of 4194304
> osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
> osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader read 1448 of 2825073
>
> which I take to mean that the reader thread sat in poll() for 56 secs, in
> this case.
>
> I was able to correlate such stalls with tcpdump output collected on
> clients.  Here's an example from another run:
>
> 15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
> 15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
> 15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
> 15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
> 1096224], length 31856
> 15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
> 1096224], length 31856
> 15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
> 15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
> 1096266], length 31856
> 15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
> 1096266], length 31856
> 15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr
> 1096266], length 1448
> 15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr
> 1096266], length 1448
> 15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr
> 1096266], length 1448
> 15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr
> 1096266], length 1448
> 15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr
> 1096266], length 1448
> 15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr
> 1096266], length 1448
> 15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr
> 1096266], length 1448
> 15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr
> 1096266], length 1448
> 15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1096266], length 1448
> 15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
> 15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 2896
> 15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
> 15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
>
> So in this case the client retransmitted for ~2 minutes with no response
> from
> the OSD.  Note that during this time the client was talking to other OSDs on
> the
> same server.
>
> I want to try turning off GSO/GRO on my interfaces, but then
> I think I need to post to netdev...
>
> -- Jim

The network trace output looks weird, it either means all of the
packets between 23822697:23886409 were lost or a bug in the networking
stack. The application should have no effect on the acks that should
have been generated. Even if you assume one or more of the frames on
the wire between 23822697:23886409 were somehow lost, you would have
had to see some sort of duplicate acks with sack segments.

Is this a bunch of bare metal servers or are these virtual? If you
could tap the network just upstream of the OSD servers, it would help
narrow down where to look at. You could also just turning off GRO/GSO,
as you suggest, to see if it makes a difference.

 Sridhar
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux