Re: Scrub / SnapTrim IO Prioritization and Latency

Samuel Just <sam.just@xxxxxxxxxxx> · Thu, 30 Oct 2014 13:37:39 -0700



I *think* that would work.  Like I said though, most of the primary
side recovery work still occurs in its own threadpool and does not use
the prioritization scheme at all.
-Sam

On Thu, Oct 30, 2014 at 1:22 PM, Dan van der Ster
<daniel.vanderster@xxxxxxx> wrote:
> Hi Sam,
>
> October 30 2014 8:30 PM, "Samuel Just" <sam.just@xxxxxxxxxxx> wrote:
>> 1. Recovery is trickier, we probably aren't marking them with a
>> sufficiently high cost. Also, a bunch of the recovery cost
>> (particularly primary-side backfill scans and pushes) happens in the
>> recovery_tp (something that this design would fix) rather than in the
>> OpWQ.
>>
>> 2. The OpWq does have a separate queue for each priority level. For
>> priorities above 63, the queues are strict -- we always process
>> higher queues until empty. For queues 1-63, we try to weight by
>> priority. I could add a "background" queue (<0?) concept which only
>> runs when above queues are empty, but I worry about deferring scrub
>> and snap trimming for too long.
>>
>
> Is there something preventing me from setting osd_client_op_priority to 64 -- for a test? That would more or less simulate the existence of a background queue, right? (I mean, if I could make client ops use enqueue_strict that might help with recovery transparency...)
>
>> 3. The whole pg lock is necessary basically because ops are ordered on
>> a pg basis.
>>
>> 4. For a non-saturated cluster, the client IO queue (63) will tend to
>> have the max number of tokens when an IO comes in, and that IO will
>> tend to be processed immediately.
>
> Meaning Ceph will dispatch it immediately -- sure. I'm more worried about IOs ongoing or queued in the kernel.
>
>> I was mentioning that as a worst
>> case scenario. Scrub already won't even start on a pg unless the OSD
>> is relatively unloaded.
>
> In our case, scrub always waits until the max interval expires. So there is always load, yet always enough IOPS left to get the scrub done transparently.
>
> Actually, in case it wasn't obvious.. my whole argument is based on experience with OSDs having a colocated journal and FileStore -- no SSD. With a dedicated (or at least separate) journal device, I imagine that most of the impact of scrubbing/trimming on write latency would drop to zero. Maybe it's not worth optimising Ceph for RBD clusters that didn't spend the money on fast journals.
>
> Cheers, Dan
>
>
>> -Sam
>>
>> On Thu, Oct 30, 2014 at 11:25 AM, Dan van der Ster
>> <daniel.vanderster@xxxxxxx> wrote:
>>
>>> Hi Sam,
>>> A few comments.
>>>
>>> 1. My understanding is that your new approach would treat the scrub/trim ops similarly to (or
>> even
>>> exactly like?) how we treat recovery ops today. Is that right? Currently even with recovery op
>>> priority=1 and client op priority=63, recoveries are not even close to being transparent. It's
>>> anecdotal, but in our cluster we regularly have 30 OSDs scrubbing (out of ~900) and it is latency
>>> transparent. But if we have 10 OSDs backfilling that increases our 4kB write latency from ~40ms
>> to
>>> ~60-80ms.
>>>
>>> 2. I get the impression that you're worried that the idle IO priority class leaves us at a risk
>> of
>>> starving the disk thread completely. Except in extreme situations of an OSD that is 100%
>> saturated
>>> with client IO for a very long time, that shouldn't happen. Suppose the client IOs account for a
>>> 30% duty cycle of a disk, then scrubbing can use the other 70%. Regardless of which IO priority
>> or
>>> queuing we do, the scrubber will get 70% of time on the disk. But the important thing is that the
>>> client IOs need to be handled as close to real time as possible, whereas the scrubs can happen at
>>> any time. I don't believe ceph-level op queuing (with a single queue!) is enough to ensure this
>> --
>>> we also need to tell the kernel the priority of those (concurrent) IOs so it can preempt the
>>> unimportant scrub reads with the urgent client IOs. My main point here is that (outside of the
>>> client IO saturation case), bytes scrubbed per second is more or less independent of IO
>> priority!!!
>>>
>>> 3. Re: locks -- OK, I can't comment there. Perhaps those locks are the reason that scrubs are
>> ever
>>> so slightly noticeable even when the IO priority of the disk thread is idle. But I contend that
>>> using separate threads -- or at least separate queues -- for the scrubs vs client ops is still a
>>> good idea. We can learn from how cfq prioritizes IOs, for example -- each of real time, best
>>> effort, and idle are implemented as a separate queue, and the be/idle queues are only processed
>> if
>>> the rt/be queues are empty. (in testing I noticed that putting scrubs in be/7 (with client IOs
>> left
>>> in be/4) is not nearly as effective as putting scrubs in the idle class -- what I conclude is
>> using
>>> a single queue for both scrub/client IOs is not effective at reducing latency).
>>>
>>> BTW, is the current whole-PG lock a necessary result of separating the client and disk
>>> queues/threads? Perhaps that can be improved another way...
>>>
>>> 4. Lastly, are you designing mainly for the 24/7 saturation scenario? I'm not sure that's a good
>>> idea -- IMHO long term saturation is a sign of a poorly dimensioned cluster. If OTOH a cluster is
>>> saturated for only 12 hours a day, I honestly don't want scrubs during those 12 hours; I'd rather
>>> they happen at night or whatever. I guess that is debatable, so you better have a configurable
>>> priority (which you have now!). For reference, btrfs scrub is idle by default [1], and zfs [2]
>>> operates similarly. (I can't confirm md raid scrubs with idle priority, but based on experience
>> it
>>> is transparent). They all have knobs to increase the priority for admins with saturated servers.
>> So
>>> I don't see why the Ceph default should not be idle (and I worry that you'd even remove the idle
>>> scrub capability).
>>>
>>> In any case, I just wanted raise this issue so that you might consider them in your
>> implementation.
>>> If I can be of any help at all in testing or giving feedback please don't hesitate to let me
>> know.
>>>
>>> Best Regards,
>>> Dan
>>>
>>> [1] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-scrub
>>> [2] http://serverfault.com/questions/499739/tuning-zfs-scrubbing-141kb-s-running-for-15-days
>>>
>>> October 30 2014 5:57 PM, "Samuel Just" <sam.just@xxxxxxxxxxx> wrote:
>>>> I think my main concern with the thread io priority approach is that
>>>> we hold locks while performing those operations. Slowing them down
>>>> will block any client operation on the same pg until the operation
>>>> completes -- probably not quite what we want. The number of scrub ops
>>>> in the queue should not have an impact, the intention is that we do 63
>>>> "cost" of items out of the 63 queue for every 1 "cost" we do out of
>>>> the 1 priority queue. It's probably the case that 1-63 isn't enough
>>>> range, might make sense to make the priority range finer (x10 or
>>>> something). You seem to be arguing for a priority of 0, but that
>>>> would not guarantee progress for snap removal or scrub which would, I
>>>> think, not be acceptable. We do want snap trims and scrub to slow
>>>> down client IO (when the cluster is actually saturated) a little.
>>>> -Sam
>>>>
>>>> On Thu, Oct 30, 2014 at 3:59 AM, Dan van der Ster
>>>> <daniel.vanderster@xxxxxxx> wrote:
>>>>
>>>>> Hi Sam,
>>>>> Sorry I missed the discussion last night about putting the trim/scrub operations in a priority
>>>>
>>>> opq
>>>>> alongside client ops. I had a question about the expected latency impact of this approach.
>>>>>
>>>>> I understand that you've previously validated that your priority queue manages to fairly
>>>>
>>>> apportion
>>>>> bandwidth (i.e. time) according to the relative op priorities. But how are the latency of
>> client
>>>>> ops going to be affected when the opq is full of scrub/trim ops? E.g. if we have 10000 scrub
>> ops
>>>>
>>>> in
>>>>> the queue with priority 1, how much extra latency do you expect a single incoming client op
>> with
>>>>> priority 63 to have?
>>>>>
>>>>> We really need scrub and trim to be completely transparent (latency- and bandwidth-wise). I
>>> agree
>>>>> that your proposal sounds like a cleaner approach, but the current implementation is actually
>>>>> working transparently as far as I can tell.
>>>>>
>>>>> It's just not obvious to me that the current out-of-band (and backgrounded with idle io
>>> priority)
>>>>> scrubber/trimmer is a less worthy approach than putting those ops in-band with the clients IOs.
>>>>> With your proposed change, at best, I'd expect that every client op is going to have to wait
>> for
>>>>
>>>> at
>>>>> least one ongoing scrub op to complete. That could be 10's of ms's on an RBD cluster... bad
>>> news.
>>>>> So I think, at least, that we'll need to continue ionicing the scrub/trim ops so that the
>> kernel
>>>>> will service the client IOs immediately instead of waiting.
>>>>>
>>>>> Your overall goal here seems to put a more fine grained knob on the scrub/trim ops. But in
>>>>
>>>> practice
>>>>> we just want those to be invisible.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html