Re: some thoughts on BlueFS space gift redesign

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 10/12/2018 7:59 PM, Sage Weil wrote:
On Fri, 12 Oct 2018, Igor Fedotov wrote:
My only concern with an ondisk compat change like this is we break
downgrade (e.g., from 12.2.10 to 12.2.9 or whatever).  I think repeating
the reconciliation on every startup is a small price to pay to avoid that
concern.  Or, maybe we only repeat the reconciliation on mimic and
luminous but not on nautilus?  Regardless, I think it is cheap: we've
already loaded all the freelist state into memory.  It might not be
worth the effort to skip it.
I'm afraid it wouldn't help - reconciliation is able to recover from DB to
BlueFS only. I.e. it assumes DB replica is always valid while BlueFS might be
incomplete.
That's not the case for us here.
The "reconciliation" I'm referring to would be the other way around:
BlueFS is always authoritative, and on BlueStore startup, we compare what
bluefs reports as it's extents to the bluefs_extents in bluestore and make
sure they match, and also make sure the freelist correctly shows those
extents as in-use.

So, if bluefs claimed some extra space, then crashed before bluestore
committed that fact into rocksdb, then on the next startup we notice and
mark those extents as in-use and update bluefs_extents.
Then this reconciliation has to be executed rather on shutdown. And proper shutdown will be the mandatory precondition for downgrade.

And in fact we still have a chance that FM is out of sync since our last operation is FM update which might trigger some ops (DB compaction?) which need BlueFS allocation. Surely chances to get this are pretty low. Especially if we try to keep some spare BlueFS space.

Igor.




sage




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux