On Fri, 12 Oct 2018, Igor Fedotov wrote: > > My only concern with an ondisk compat change like this is we break > > downgrade (e.g., from 12.2.10 to 12.2.9 or whatever). I think repeating > > the reconciliation on every startup is a small price to pay to avoid that > > concern. Or, maybe we only repeat the reconciliation on mimic and > > luminous but not on nautilus? Regardless, I think it is cheap: we've > > already loaded all the freelist state into memory. It might not be > > worth the effort to skip it. > I'm afraid it wouldn't help - reconciliation is able to recover from DB to > BlueFS only. I.e. it assumes DB replica is always valid while BlueFS might be > incomplete. > That's not the case for us here. The "reconciliation" I'm referring to would be the other way around: BlueFS is always authoritative, and on BlueStore startup, we compare what bluefs reports as it's extents to the bluefs_extents in bluestore and make sure they match, and also make sure the freelist correctly shows those extents as in-use. So, if bluefs claimed some extra space, then crashed before bluestore committed that fact into rocksdb, then on the next startup we notice and mark those extents as in-use and update bluefs_extents. sage