Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

"Jim Schutt" <jaschut@xxxxxxxxxx> · Thu, 2 Feb 2012 08:38:52 -0700

(resent because I forgot the list on my original reply)

On 02/01/2012 03:33 PM, Gregory Farnum wrote:
On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt<jaschut@xxxxxxxxxx>  wrote:
Hi,

FWIW, I've been trying to understand op delays under very heavy write
load, and have been working a little with the policy throttler in hopes of
using throttling delays to help track down which ops were backing up.
Without much success, unfortunately.

When I saw the wip-osd-op-tracking branch, I wondered if any of this
stuff might be helpful.  Here it is, just in case.

In general these patches are dumping information to the logs, and part
of the wip-osd-op-tracking branch is actually keeping track of most of
the message queueing wait times as part of the message itself
(although not the information about number of waiters and sleep/wake
seqs). I'm inclined to prefer that approach to log dumping.

I agree - I've just been using log dumping because I can extract
any relationships I can write a perl script to find :)  So far,
not too helpful.

Are there any patches you recommend for merging? I'm a little curious
about the ordered wakeup one — do you have data about when that's a
problem?

I've been trying to push the client:osd ratio, and in my testbed
I can run up to 166 linux clients. Right now I'm running them
against 48 OSDs.  The clients are 1 Gb/s ethernet, and the OSDs
have a 10 Gb/s ethernet for clients and another for the cluster.

During sustained write loads I see a factor of 10 oscillation
in aggregate throughput, and during that time I see clients
stuck in the policy throttler for hundreds of seconds, and I
see a number of waiters equal to
  number of clients - (throttler limit) / (msg size)
If I do a histogram of throttler wait times I see a handful of
messages that wait for an extra couple hundreds of seconds
without the ordered wakeup.

I'm not sure what this will look like if my throughput
variations can be fixed.  But, for our HPC loads I expect
we'll often see periods where offered load is much higher
that aggregate bandwidth of any system we can afford to
build, so ordered wakeup may be useful in such cases for
client fairness.

So I'd recommend the ordered wakeup patch if you don't
see any downsides.

Sorry for the noise on the others - mostly I just wanted
to share the sort of things I've been looking at.  I'll
be learning to use your new stuff soon...

-- Jim

-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html