Re: Scrub / SnapTrim IO Prioritization and Latency

Samuel Just <sam.just@xxxxxxxxxxx> · Thu, 30 Oct 2014 13:38:02 -0700



Also, I think you probably have to set that on the client side.
-Sam

On Thu, Oct 30, 2014 at 1:37 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
> I *think* that would work.  Like I said though, most of the primary
> side recovery work still occurs in its own threadpool and does not use
> the prioritization scheme at all.
> -Sam
>
> On Thu, Oct 30, 2014 at 1:22 PM, Dan van der Ster
> <daniel.vanderster@xxxxxxx> wrote:
>> Hi Sam,
>>
>> October 30 2014 8:30 PM, "Samuel Just" <sam.just@xxxxxxxxxxx> wrote:
>>> 1. Recovery is trickier, we probably aren't marking them with a
>>> sufficiently high cost. Also, a bunch of the recovery cost
>>> (particularly primary-side backfill scans and pushes) happens in the
>>> recovery_tp (something that this design would fix) rather than in the
>>> OpWQ.
>>>
>>> 2. The OpWq does have a separate queue for each priority level. For
>>> priorities above 63, the queues are strict -- we always process
>>> higher queues until empty. For queues 1-63, we try to weight by
>>> priority. I could add a "background" queue (<0?) concept which only
>>> runs when above queues are empty, but I worry about deferring scrub
>>> and snap trimming for too long.
>>>
>>
>> Is there something preventing me from setting osd_client_op_priority to 64 -- for a test? That would more or less simulate the existence of a background queue, right? (I mean, if I could make client ops use enqueue_strict that might help with recovery transparency...)
>>
>>> 3. The whole pg lock is necessary basically because ops are ordered on
>>> a pg basis.
>>>
>>> 4. For a non-saturated cluster, the client IO queue (63) will tend to
>>> have the max number of tokens when an IO comes in, and that IO will
>>> tend to be processed immediately.
>>
>> Meaning Ceph will dispatch it immediately -- sure. I'm more worried about IOs ongoing or queued in the kernel.
>>
>>> I was mentioning that as a worst
>>> case scenario. Scrub already won't even start on a pg unless the OSD
>>> is relatively unloaded.
>>
>> In our case, scrub always waits until the max interval expires. So there is always load, yet always enough IOPS left to get the scrub done transparently.
>>
>> Actually, in case it wasn't obvious.. my whole argument is based on experience with OSDs having a colocated journal and FileStore -- no SSD. With a dedicated (or at least separate) journal device, I imagine that most of the impact of scrubbing/trimming on write latency would drop to zero. Maybe it's not worth optimising Ceph for RBD clusters that didn't spend the money on fast journals.
>>
>> Cheers, Dan
>>
>>
>>> -Sam
>>>
>>> On Thu, Oct 30, 2014 at 11:25 AM, Dan van der Ster
>>> <daniel.vanderster@xxxxxxx> wrote:
>>>
>>>> Hi Sam,
>>>> A few comments.
>>>>
>>>> 1. My understanding is that your new approach would treat the scrub/trim ops similarly to (or
>>> even
>>>> exactly like?) how we treat recovery ops today. Is that right? Currently even with recovery op
>>>> priority=1 and client op priority=63, recoveries are not even close to being transparent. It's
>>>> anecdotal, but in our cluster we regularly have 30 OSDs scrubbing (out of ~900) and it is latency
>>>> transparent. But if we have 10 OSDs backfilling that increases our 4kB write latency from ~40ms
>>> to
>>>> ~60-80ms.
>>>>
>>>> 2. I get the impression that you're worried that the idle IO priority class leaves us at a risk
>>> of
>>>> starving the disk thread completely. Except in extreme situations of an OSD that is 100%
>>> saturated
>>>> with client IO for a very long time, that shouldn't happen. Suppose the client IOs account for a
>>>> 30% duty cycle of a disk, then scrubbing can use the other 70%. Regardless of which IO priority
>>> or
>>>> queuing we do, the scrubber will get 70% of time on the disk. But the important thing is that the
>>>> client IOs need to be handled as close to real time as possible, whereas the scrubs can happen at
>>>> any time. I don't believe ceph-level op queuing (with a single queue!) is enough to ensure this
>>> --
>>>> we also need to tell the kernel the priority of those (concurrent) IOs so it can preempt the
>>>> unimportant scrub reads with the urgent client IOs. My main point here is that (outside of the
>>>> client IO saturation case), bytes scrubbed per second is more or less independent of IO
>>> priority!!!
>>>>
>>>> 3. Re: locks -- OK, I can't comment there. Perhaps those locks are the reason that scrubs are
>>> ever
>>>> so slightly noticeable even when the IO priority of the disk thread is idle. But I contend that
>>>> using separate threads -- or at least separate queues -- for the scrubs vs client ops is still a
>>>> good idea. We can learn from how cfq prioritizes IOs, for example -- each of real time, best
>>>> effort, and idle are implemented as a separate queue, and the be/idle queues are only processed
>>> if
>>>> the rt/be queues are empty. (in testing I noticed that putting scrubs in be/7 (with client IOs
>>> left
>>>> in be/4) is not nearly as effective as putting scrubs in the idle class -- what I conclude is
>>> using
>>>> a single queue for both scrub/client IOs is not effective at reducing latency).
>>>>
>>>> BTW, is the current whole-PG lock a necessary result of separating the client and disk
>>>> queues/threads? Perhaps that can be improved another way...
>>>>
>>>> 4. Lastly, are you designing mainly for the 24/7 saturation scenario? I'm not sure that's a good
>>>> idea -- IMHO long term saturation is a sign of a poorly dimensioned cluster. If OTOH a cluster is
>>>> saturated for only 12 hours a day, I honestly don't want scrubs during those 12 hours; I'd rather
>>>> they happen at night or whatever. I guess that is debatable, so you better have a configurable
>>>> priority (which you have now!). For reference, btrfs scrub is idle by default [1], and zfs [2]
>>>> operates similarly. (I can't confirm md raid scrubs with idle priority, but based on experience
>>> it
>>>> is transparent). They all have knobs to increase the priority for admins with saturated servers.
>>> So
>>>> I don't see why the Ceph default should not be idle (and I worry that you'd even remove the idle
>>>> scrub capability).
>>>>
>>>> In any case, I just wanted raise this issue so that you might consider them in your
>>> implementation.
>>>> If I can be of any help at all in testing or giving feedback please don't hesitate to let me
>>> know.
>>>>
>>>> Best Regards,
>>>> Dan
>>>>
>>>> [1] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-scrub
>>>> [2] http://serverfault.com/questions/499739/tuning-zfs-scrubbing-141kb-s-running-for-15-days
>>>>
>>>> October 30 2014 5:57 PM, "Samuel Just" <sam.just@xxxxxxxxxxx> wrote:
>>>>> I think my main concern with the thread io priority approach is that
>>>>> we hold locks while performing those operations. Slowing them down
>>>>> will block any client operation on the same pg until the operation
>>>>> completes -- probably not quite what we want. The number of scrub ops
>>>>> in the queue should not have an impact, the intention is that we do 63
>>>>> "cost" of items out of the 63 queue for every 1 "cost" we do out of
>>>>> the 1 priority queue. It's probably the case that 1-63 isn't enough
>>>>> range, might make sense to make the priority range finer (x10 or
>>>>> something). You seem to be arguing for a priority of 0, but that
>>>>> would not guarantee progress for snap removal or scrub which would, I
>>>>> think, not be acceptable. We do want snap trims and scrub to slow
>>>>> down client IO (when the cluster is actually saturated) a little.
>>>>> -Sam
>>>>>
>>>>> On Thu, Oct 30, 2014 at 3:59 AM, Dan van der Ster
>>>>> <daniel.vanderster@xxxxxxx> wrote:
>>>>>
>>>>>> Hi Sam,
>>>>>> Sorry I missed the discussion last night about putting the trim/scrub operations in a priority
>>>>>
>>>>> opq
>>>>>> alongside client ops. I had a question about the expected latency impact of this approach.
>>>>>>
>>>>>> I understand that you've previously validated that your priority queue manages to fairly
>>>>>
>>>>> apportion
>>>>>> bandwidth (i.e. time) according to the relative op priorities. But how are the latency of
>>> client
>>>>>> ops going to be affected when the opq is full of scrub/trim ops? E.g. if we have 10000 scrub
>>> ops
>>>>>
>>>>> in
>>>>>> the queue with priority 1, how much extra latency do you expect a single incoming client op
>>> with
>>>>>> priority 63 to have?
>>>>>>
>>>>>> We really need scrub and trim to be completely transparent (latency- and bandwidth-wise). I
>>>> agree
>>>>>> that your proposal sounds like a cleaner approach, but the current implementation is actually
>>>>>> working transparently as far as I can tell.
>>>>>>
>>>>>> It's just not obvious to me that the current out-of-band (and backgrounded with idle io
>>>> priority)
>>>>>> scrubber/trimmer is a less worthy approach than putting those ops in-band with the clients IOs.
>>>>>> With your proposed change, at best, I'd expect that every client op is going to have to wait
>>> for
>>>>>
>>>>> at
>>>>>> least one ongoing scrub op to complete. That could be 10's of ms's on an RBD cluster... bad
>>>> news.
>>>>>> So I think, at least, that we'll need to continue ionicing the scrub/trim ops so that the
>>> kernel
>>>>>> will service the client IOs immediately instead of waiting.
>>>>>>
>>>>>> Your overall goal here seems to put a more fine grained knob on the scrub/trim ops. But in
>>>>>
>>>>> practice
>>>>>> we just want those to be invisible.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html