On Sat, 23 Dec 2017, Oleg Kolosov wrote: > Hi > When ceph selects an OSD to act as primary, it is the first shard in > the PG (shard[0]). When working with LRC plugin, this constraint > greatly diminishes LRC's advantage. Ceph LRC plugin allows grouping of > OSD, so that recovery would occur between them (local groups). > However, since all recovery has to go through the primary OSD, in fact > the recovery isn't really local. Only in case the primary OSD is in > the same local group which contains the failed OSD the recovery is > local. > > I'm running LRC experiments on ceph and measure recovery by crashing a > specific OSD. In order to bypass this and make LRC truly effective in > my experiments, I was wondering if it is possible to manipulate the > choice of the primary OSD. > > For example, if I have a bucket which contains OSDs 0-10, and I always > kill one of them, is it possible to force the primary also to be OSD > 0-10 ? Yes. This has been a longstanding todo item for LRC but we haven't gotten around to doing it. I think what's needed here is a change to the choose_acting logic in PG.cc that allows the EC plugin to weight in on which primary it prefers. Some care will be needed to make sure this choice is reevaluated at the appropriate times (e.g., when backfill completes). (Or possibly it won't matter since generally speaking which shard is primary doesn't matter after recovery.) We'd also need to only consult the plugin after all OSDs in the shard have a feature bit indicating they do the same or else you can get into a loop where two OSDs keep giving primary back to each other. I think the place to start is to add a method to the EC interface that allows the plugin to suggest a primary (or not, if it has no opinion), and to implement one for LRC that does the right thing. (Some unit tests here would be good!) The change to the choose_acting code would come next--that'll be trickier to get right but we can help! Thanks- sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html