Hi, This problem also happened in my customer's environment, so I want to solve this problem. To facilitate the discussion, I restate the problem and the current solution. (Mykola has already written the solution idea. I am sorry if there is anything different from Mykola's idea.) In master: Problem: A primary OSD crashes in an unnecessary situation. (I think this is a bug.) Solution: Remove the ceph_assert from the code below. --------------------------------------------------------------- diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc index 626e8ccefb..12956424bd 100644 --- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -13079,7 +13079,6 @@ void PrimaryLogPG::_clear_recovery_state() last_backfill_started = hobject_t(); set<hobject_t>::iterator i = backfills_in_flight.begin(); while (i != backfills_in_flight.end()) { - ceph_assert(recovering.count(*i)); backfills_in_flight.erase(i++); } --------------------------------------------------------------- The reason is as follows. - The above code assumes that all of the objects contained in backfills_in_flight are contained in recovering. - However, the current implementation of on_failed_pull[1], if it is non-primary OSD, unfound objects will remain only in backfills_in_flight. (but unconditionally removed from recovering[2]) Therefore, the above ceph_assert does not match the current implementation of on_failed_pull. I thinks this ceph_assert should be removed, but I would like to hear opinion from the community. [1]: https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/src/osd/PrimaryLogPG.cc#L12453-L12456 [2]: https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/src/osd/PrimaryLogPG.cc#L12439 In nautilus: Problem: backfill_unfound state becomes clear when the OSD is restarted. (This is also a bug.) This causes a user to mistakenly think the problem has been solved and cause unexpected trouble. Solution: Remain unfound objects in backfills_in_flight such as on_failed_pull, if it is non-primary OSD. There is the following commit[3], but as the range of correction of this commit is wide, so I think only the minimum correction necessary for problem solving should be directly committed to nautilus. [3]: https://github.com/ceph/ceph/commit/8a8947d2a32d6390cb17099398e7f2212660c9a1 In addition, if this problem is solved, the problem that primary OSD crashes occurs, so the commit of the master described above needs to be backported. I am considering sending PRs next week, so please let me know if you have any opinions from the community before that. -- Jin _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx