On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: >> So this affects just writes. Then I'm really not following the >> reasoning behind the current behavior. Why would you want to wait for >> the recovery of an object that you're about to clobber anyway? Naïvely >> thinking an object like that would look like a candidate for >> *eviction* from the recovery queue, not promotion to a higher >> priority. Is this because the write could be a partial write, whereas >> recovery would need to cover the full object? > > > Generally most writes are partial writes - for RBD that's almost always > the case - often writes are 512b or 4kb. It's also true for e.g. RGW > bucket index updates (adding an omap key/value pair). Sure, makes sense. >> This is all under the disclaimer that I have no detailed >> knowledge of the internals so this is all handwaving, but would a more >> logical sequence of events not look roughly like this: >> >> 1. Are all replicas of the object available? If so, goto 4. >> 2. Is the write a full object write? If so, goto 4. >> 3. Read the local copy of the object, splice in the partial write, >> making it a full object write. >> 4. Evict the object from the recovery queue. >> 5. Replicate the write. >> >> Forgive the silly use of goto; I'm wary of email clients mangling >> indentation if I were to write this as a nested if block. :) > > > This might be a useful optimization in some cases, but it would be > rather complex to add to the recovery code. It may be worth considering > at some point - same with deletes or other cases where the previous data > is not needed. Uh, yeah, waiting for an object to recover just so you can then delete it, and blocking the delete I/O in the process, does also seem rather very strange. I think we do agree that any instance of I/O being blocked upward of 30s in a VM is really really bad, but the way you describe it, I see little chance for a Ceph-deploying cloud operator to ever make a compelling case to their customers that such a thing is unlikely to happen. And I'm not even sure if a knee-jerk reaction to buy faster hardware would be a very prudent investment: it's basically all just a factor of (a) how much I/O happens on a cluster during an outage, (b) how many nodes/OSDs will be affected by that outage. Neither is very predictable, and only (b) is something you have any influence over in a cloud environment. Beyond a certain threshold of either (a) or (b), the probability of *recovery* slowing a significant number of VMs to a crawl approximates 1. For an rgw bucket index pool, that's usually a sufficiently small amount of data that allows you to sprinkle a few fast drives throughout your cluster, create a ruleset with a separate root (pre-Luminous) or making use of classes (Luminous and later), and then assign that ruleset to the pool. But for RBD storage, that's usually not an option — not at non-prohibitive cost, anyway. Can you share your suggested workaround / mitigation strategy for users that are currently being bitten by this behavior? If async recovery lands in mimic with no chance of a backport, then it'll be a while before LTS users get any benefit out of it. Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com