Re: backfill_unfound state reset to clean after osd restart

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

This problem also happened in my customer's environment, so I want to solve this
problem.
To facilitate the discussion, I restate the problem and the current solution.
(Mykola has already written the solution idea. I am sorry if there is anything different
from Mykola's idea.)

In master:
Problem: A primary OSD crashes in an unnecessary situation. (I think this is a bug.)
Solution: Remove the ceph_assert from the code below.
---------------------------------------------------------------
diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 626e8ccefb..12956424bd 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -13079,7 +13079,6 @@ void PrimaryLogPG::_clear_recovery_state()
   last_backfill_started = hobject_t();
   set<hobject_t>::iterator i = backfills_in_flight.begin();
   while (i != backfills_in_flight.end()) {
-    ceph_assert(recovering.count(*i));
     backfills_in_flight.erase(i++);
   }
---------------------------------------------------------------

The reason is as follows.
- The above code assumes that all of the objects contained in backfills_in_flight are
contained in recovering.
- However, the current implementation of on_failed_pull[1], if it is non-primary OSD,
unfound objects will remain only in backfills_in_flight. (but unconditionally removed from
recovering[2])

Therefore, the above ceph_assert does not match the current implementation of
on_failed_pull.
I thinks this ceph_assert should be removed, but I would like to hear opinion from the
community.

[1]:
https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/…;
[2]:
https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/…;

In nautilus:
Problem: backfill_unfound state becomes clear when the OSD is restarted. (This is also a
bug.)
This causes a user to mistakenly think the problem has been solved and cause unexpected
trouble.
Solution: Remain unfound objects in backfills_in_flight such as on_failed_pull, if it is
non-primary OSD.
There is the following commit[3], but as the range of correction of this commit is wide,
so I think only the minimum correction necessary for problem solving should be directly
committed to nautilus. 

[3]: https://github.com/ceph/ceph/commit/8a8947d2a32d6390cb17099398e7f2212660c9a1

In addition, if this problem is solved, the problem that primary OSD crashes occurs, so
the commit of the master described above needs to be backported.
I am considering sending PRs next week, so please let me know if you have any opinions
from the community before that.

--
Jin
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux