Re: backfill_unfound state reset to clean after osd restart

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

This problem also happened in my customer's environment, so I want to solve this problem.
To facilitate the discussion, I restate the problem and the current solution.
(Mykola has already written the solution idea. I am sorry if there is anything different from Mykola's idea.)

In master:
Problem: A primary OSD crashes in an unnecessary situation. (I think this is a bug.)
Solution: Remove the ceph_assert from the code below.
---------------------------------------------------------------
diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 626e8ccefb..12956424bd 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -13079,7 +13079,6 @@ void PrimaryLogPG::_clear_recovery_state()
   last_backfill_started = hobject_t();
   set<hobject_t>::iterator i = backfills_in_flight.begin();
   while (i != backfills_in_flight.end()) {
-    ceph_assert(recovering.count(*i));
     backfills_in_flight.erase(i++);
   }
---------------------------------------------------------------

The reason is as follows.
- The above code assumes that all of the objects contained in backfills_in_flight are contained in recovering.
- However, the current implementation of on_failed_pull[1], if it is non-primary OSD, unfound objects will remain only in backfills_in_flight. (but unconditionally removed from recovering[2])

Therefore, the above ceph_assert does not match the current implementation of on_failed_pull.
I thinks this ceph_assert should be removed, but I would like to hear opinion from the community.

[1]: https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/src/osd/PrimaryLogPG.cc#L12453-L12456
[2]: https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/src/osd/PrimaryLogPG.cc#L12439

In nautilus:
Problem: backfill_unfound state becomes clear when the OSD is restarted. (This is also a bug.)
This causes a user to mistakenly think the problem has been solved and cause unexpected trouble.
Solution: Remain unfound objects in backfills_in_flight such as on_failed_pull, if it is non-primary OSD.
There is the following commit[3], but as the range of correction of this commit is wide, so I think only the minimum correction necessary for problem solving should be directly committed to nautilus. 

[3]: https://github.com/ceph/ceph/commit/8a8947d2a32d6390cb17099398e7f2212660c9a1

In addition, if this problem is solved, the problem that primary OSD crashes occurs, so the commit of the master described above needs to be backported.
I am considering sending PRs next week, so please let me know if you have any opinions from the community before that.

--
Jin
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux