Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 2 Feb 2012 16:28:36 -0800

On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt <jaschut@xxxxxxxxxx> wrote:
> I found 0 instances of "waiting for commit" in all my OSD logs for my last
> run.
>
> So I never waited on the journal?

Looks like it. Interesting.

>>> So far I'm looking at two behaviours I've noticed that seem anomalous to
>>> me.
>>>
>>> One is that I instrumented ms_dispatch(), and I see it take
>>> a half-second or more several hundred times, out of several
>>> thousand messages.  Is that expected?
>>
>>
>> How did you instrument it? If you wrapped the whole function it's
>> possible that those longer runs are actually chewing through several
>> messages that had to get waitlisted for some reason previously.
>> (That's the call to do_waiters().)
>
>
> Yep, I wrapped the whole function, and also instrumented taking osd_lock
> while I was there.  About half the time that ms_dispatch() takes more than
> 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
> dispatch threads, one for ops and one for rep_ops, right?  So one's
> waiting on the other?

There's just one main dispatcher; no split for the ops and rep_ops .
The reason for that "dispatch_running" is that if there are requests
waiting then the tick() function will run through them if the
messenger dispatch thread is currently idle.
But it is possible for the Messenger to try and dispatch, and for that
to be blocked while some amount of (usually trivial) work is being
done by a different thread, yes. I don't think we've ever observed it
being a problem for anything other than updating OSD maps, though...

>>> Another is that once a message receive starts, I see ~50 messages
>>> that take tens of seconds to receive, when the nominal receive time is
>>> a half-second or less.  I'm in the process of tooling up to collect
>>> tcpdump data on all my clients to try to catch what is going on with
>>> that.
>>
>>
>> Again, how are you instrumenting that?
>
>
> I post-process the logs, looking at the time difference between
> "reader got .* policy throttler" and "reader got .* osd_op(client".

I guess the logging output must have changed a bit at some pointer (or
was that one of your patches?). master has "reader wants" not "reader
got" for the policy throttler. (Just got a little confused when
checking the code.)

> When I find a candidate message, I grep the log for just that reader thread,
> and see, e.g., this:
>
> osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000
> seq 828/828 waiters 157/149 for src client.4301 tid=247
> osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
> 41944358/66666666
> osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader got front 128
> osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
> osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
> osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
> osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
>
> Note the ~2 minute delay (and ~300,000 lines of logging) between the first
> and second reads.
>
> During that time 129 sockets were processed - what makes sd=215 special?

Hrm. Well, you can try turning up the messenger debugging to 30 and
taking advantage of the "reader reading" "reader read" pair right
around tcp_read_nonblocking.

> I've added tracepoints in my client kernel try_write(), and nothing seems
> unusual (that's with running the patch to ceph_write_space() I posted
> earlier):
>
>     kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
> tid 179 seq 3 sent 4194304
>     kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer osd0
> tid 207 seq 4 sent 0
*snip*
>     kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0
> tid 207 seq 4 sent 4194304
>
> There's a 25 second gap at 1543.236256, but nothing like the
> 100 second gap in the reader.
>
> Hence, tcpdump seems like a good idea?

You do bring us interesting problems! Let us know what info you come up with.

Oh, and I keep forgetting to ask: what does the write workload look
like? At first I assumed this was a CephFS workload, but given that
you're changing max message sizes and have half-second writes you're
probably doing something else?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html