On Thu, Sep 14, 2017 at 2:47 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > On 09/13/2017 03:40 AM, Florian Haas wrote: >> >> So we have a client that is talking to OSD 30. OSD 30 was never down; >> OSD 17 was. OSD 30 is also the preferred primary for this PG (via >> primary affinity). The OSD now says that >> >> - it does itself have a copy of the object, >> - so does OSD 94, >> - but that the object is "also" missing on OSD 17. >> >> So I'd like to ask firstly: what does "also" mean here? > > > Nothing, it's just included in all the log messages in the loop looking > at whether objects are missing. OK, maybe the "also" can be removed to reduce potential confusion? >> Secondly, if the local copy is current, and we have no fewer than >> min_size objects, and recovery is meant to be a background operation, >> then why is the recovery in the I/O path here? Specifically, why is >> that the case on a write, where the object is being modified anyway, >> and the modification then needs to be replicated out to OSDs 17 and >> 94? > > > Mainly because recovery pre-dated the concept of min_size. We realized > this was a problem during luminous development, but did not complete the > fix for it in time for luminous. Nice analysis of the issue though! Well I wasn't quite done with the analysis yet, I just wanted to check whether my initial interpretation was correct. So, here's what this behavior causes, if I understand things correctly: - We have a bunch of objects that need to be recovered onto the just-returned OSD(s). - Clients access some of these objects while they are pending recovery. - When that happens, recovery of those objects gets reprioritized. Simplistically speaking, they get to jump the queue. Did I get that right? If so, let's zoom out a bit now and look at RBD's most frequent use case, virtualization. While the OSDs were down, the RADOS objects that were created or modified would have come from whatever virtual machines were running at that time. When the OSDs return, there's a very good chance that those same VMs are still running. While they're running, they of course continue to access the same RBDs, and are quite likely to access the same *data* as before on those RBDs — data that now needs to be recovered. So that means that there is likely a solid majority of to-be-recovered RADOS objects that needs to be moved to the front of the queue at some point during the recovery. Which, in the extreme, renders the prioritization useless: if I have, say, 1,000 objects that need to be recovered but 998 have been moved to the "front" of the queue, the queue is rather meaningless. Again, on the assumption that this correctly describes what Ceph currently does, do you have suggestions for how to mitigate this? It seems to me that the only actual remedy for this issue in Jewel/Luminous would be to not access objects pending recovery, but as just pointed out, that's a rather unrealistic goal. > I'm working on the fix (aka async recovery) for mimic. This won't be > backportable unfortunately. OK — is there any more information on this that is available and current? A quick search turned up a Trello card (https://trello.com/c/jlJL5fPR/199-osd-async-recovery), a mailing list post (https://www.spinics.net/lists/ceph-users/msg37127.html), a slide deck (https://www.slideshare.net/jupiturliu/ceph-recovery-improvement-v02), a stale PR (https://github.com/ceph/ceph/pull/11918), and an inactive branch (https://github.com/jdurgin/ceph/commits/wip-async-recovery), but I was hoping for something a little more detailed. Thanks in advance for any additional insight you can share here! Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com