I *think* that would work. Like I said though, most of the primary side recovery work still occurs in its own threadpool and does not use the prioritization scheme at all. -Sam On Thu, Oct 30, 2014 at 1:22 PM, Dan van der Ster <daniel.vanderster@xxxxxxx> wrote: > Hi Sam, > > October 30 2014 8:30 PM, "Samuel Just" <sam.just@xxxxxxxxxxx> wrote: >> 1. Recovery is trickier, we probably aren't marking them with a >> sufficiently high cost. Also, a bunch of the recovery cost >> (particularly primary-side backfill scans and pushes) happens in the >> recovery_tp (something that this design would fix) rather than in the >> OpWQ. >> >> 2. The OpWq does have a separate queue for each priority level. For >> priorities above 63, the queues are strict -- we always process >> higher queues until empty. For queues 1-63, we try to weight by >> priority. I could add a "background" queue (<0?) concept which only >> runs when above queues are empty, but I worry about deferring scrub >> and snap trimming for too long. >> > > Is there something preventing me from setting osd_client_op_priority to 64 -- for a test? That would more or less simulate the existence of a background queue, right? (I mean, if I could make client ops use enqueue_strict that might help with recovery transparency...) > >> 3. The whole pg lock is necessary basically because ops are ordered on >> a pg basis. >> >> 4. For a non-saturated cluster, the client IO queue (63) will tend to >> have the max number of tokens when an IO comes in, and that IO will >> tend to be processed immediately. > > Meaning Ceph will dispatch it immediately -- sure. I'm more worried about IOs ongoing or queued in the kernel. > >> I was mentioning that as a worst >> case scenario. Scrub already won't even start on a pg unless the OSD >> is relatively unloaded. > > In our case, scrub always waits until the max interval expires. So there is always load, yet always enough IOPS left to get the scrub done transparently. > > Actually, in case it wasn't obvious.. my whole argument is based on experience with OSDs having a colocated journal and FileStore -- no SSD. With a dedicated (or at least separate) journal device, I imagine that most of the impact of scrubbing/trimming on write latency would drop to zero. Maybe it's not worth optimising Ceph for RBD clusters that didn't spend the money on fast journals. > > Cheers, Dan > > >> -Sam >> >> On Thu, Oct 30, 2014 at 11:25 AM, Dan van der Ster >> <daniel.vanderster@xxxxxxx> wrote: >> >>> Hi Sam, >>> A few comments. >>> >>> 1. My understanding is that your new approach would treat the scrub/trim ops similarly to (or >> even >>> exactly like?) how we treat recovery ops today. Is that right? Currently even with recovery op >>> priority=1 and client op priority=63, recoveries are not even close to being transparent. It's >>> anecdotal, but in our cluster we regularly have 30 OSDs scrubbing (out of ~900) and it is latency >>> transparent. But if we have 10 OSDs backfilling that increases our 4kB write latency from ~40ms >> to >>> ~60-80ms. >>> >>> 2. I get the impression that you're worried that the idle IO priority class leaves us at a risk >> of >>> starving the disk thread completely. Except in extreme situations of an OSD that is 100% >> saturated >>> with client IO for a very long time, that shouldn't happen. Suppose the client IOs account for a >>> 30% duty cycle of a disk, then scrubbing can use the other 70%. Regardless of which IO priority >> or >>> queuing we do, the scrubber will get 70% of time on the disk. But the important thing is that the >>> client IOs need to be handled as close to real time as possible, whereas the scrubs can happen at >>> any time. I don't believe ceph-level op queuing (with a single queue!) is enough to ensure this >> -- >>> we also need to tell the kernel the priority of those (concurrent) IOs so it can preempt the >>> unimportant scrub reads with the urgent client IOs. My main point here is that (outside of the >>> client IO saturation case), bytes scrubbed per second is more or less independent of IO >> priority!!! >>> >>> 3. Re: locks -- OK, I can't comment there. Perhaps those locks are the reason that scrubs are >> ever >>> so slightly noticeable even when the IO priority of the disk thread is idle. But I contend that >>> using separate threads -- or at least separate queues -- for the scrubs vs client ops is still a >>> good idea. We can learn from how cfq prioritizes IOs, for example -- each of real time, best >>> effort, and idle are implemented as a separate queue, and the be/idle queues are only processed >> if >>> the rt/be queues are empty. (in testing I noticed that putting scrubs in be/7 (with client IOs >> left >>> in be/4) is not nearly as effective as putting scrubs in the idle class -- what I conclude is >> using >>> a single queue for both scrub/client IOs is not effective at reducing latency). >>> >>> BTW, is the current whole-PG lock a necessary result of separating the client and disk >>> queues/threads? Perhaps that can be improved another way... >>> >>> 4. Lastly, are you designing mainly for the 24/7 saturation scenario? I'm not sure that's a good >>> idea -- IMHO long term saturation is a sign of a poorly dimensioned cluster. If OTOH a cluster is >>> saturated for only 12 hours a day, I honestly don't want scrubs during those 12 hours; I'd rather >>> they happen at night or whatever. I guess that is debatable, so you better have a configurable >>> priority (which you have now!). For reference, btrfs scrub is idle by default [1], and zfs [2] >>> operates similarly. (I can't confirm md raid scrubs with idle priority, but based on experience >> it >>> is transparent). They all have knobs to increase the priority for admins with saturated servers. >> So >>> I don't see why the Ceph default should not be idle (and I worry that you'd even remove the idle >>> scrub capability). >>> >>> In any case, I just wanted raise this issue so that you might consider them in your >> implementation. >>> If I can be of any help at all in testing or giving feedback please don't hesitate to let me >> know. >>> >>> Best Regards, >>> Dan >>> >>> [1] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-scrub >>> [2] http://serverfault.com/questions/499739/tuning-zfs-scrubbing-141kb-s-running-for-15-days >>> >>> October 30 2014 5:57 PM, "Samuel Just" <sam.just@xxxxxxxxxxx> wrote: >>>> I think my main concern with the thread io priority approach is that >>>> we hold locks while performing those operations. Slowing them down >>>> will block any client operation on the same pg until the operation >>>> completes -- probably not quite what we want. The number of scrub ops >>>> in the queue should not have an impact, the intention is that we do 63 >>>> "cost" of items out of the 63 queue for every 1 "cost" we do out of >>>> the 1 priority queue. It's probably the case that 1-63 isn't enough >>>> range, might make sense to make the priority range finer (x10 or >>>> something). You seem to be arguing for a priority of 0, but that >>>> would not guarantee progress for snap removal or scrub which would, I >>>> think, not be acceptable. We do want snap trims and scrub to slow >>>> down client IO (when the cluster is actually saturated) a little. >>>> -Sam >>>> >>>> On Thu, Oct 30, 2014 at 3:59 AM, Dan van der Ster >>>> <daniel.vanderster@xxxxxxx> wrote: >>>> >>>>> Hi Sam, >>>>> Sorry I missed the discussion last night about putting the trim/scrub operations in a priority >>>> >>>> opq >>>>> alongside client ops. I had a question about the expected latency impact of this approach. >>>>> >>>>> I understand that you've previously validated that your priority queue manages to fairly >>>> >>>> apportion >>>>> bandwidth (i.e. time) according to the relative op priorities. But how are the latency of >> client >>>>> ops going to be affected when the opq is full of scrub/trim ops? E.g. if we have 10000 scrub >> ops >>>> >>>> in >>>>> the queue with priority 1, how much extra latency do you expect a single incoming client op >> with >>>>> priority 63 to have? >>>>> >>>>> We really need scrub and trim to be completely transparent (latency- and bandwidth-wise). I >>> agree >>>>> that your proposal sounds like a cleaner approach, but the current implementation is actually >>>>> working transparently as far as I can tell. >>>>> >>>>> It's just not obvious to me that the current out-of-band (and backgrounded with idle io >>> priority) >>>>> scrubber/trimmer is a less worthy approach than putting those ops in-band with the clients IOs. >>>>> With your proposed change, at best, I'd expect that every client op is going to have to wait >> for >>>> >>>> at >>>>> least one ongoing scrub op to complete. That could be 10's of ms's on an RBD cluster... bad >>> news. >>>>> So I think, at least, that we'll need to continue ionicing the scrub/trim ops so that the >> kernel >>>>> will service the client IOs immediately instead of waiting. >>>>> >>>>> Your overall goal here seems to put a more fine grained knob on the scrub/trim ops. But in >>>> >>>> practice >>>>> we just want those to be invisible. >>>>> >>>>> Thoughts? >>>>> >>>>> Cheers, Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html