Re: some thoughts on BlueFS space gift redesign

Sage Weil <sweil@xxxxxxxxxx> · Fri, 12 Oct 2018 16:23:12 +0000 (UTC)

On Fri, 12 Oct 2018, Igor Fedotov wrote:
> On 10/12/2018 4:48 PM, Sage Weil wrote:
> > Hi Igor,
> > 
> > On Fri, 12 Oct 2018, Igor Fedotov wrote:
> > > Hi Sage, et al.
> > > 
> > > Let me share some ideas on the subj.
> > > 
> > > The major rationale is to be able to perform on-demand allocation from the
> > > main device space when BlueFS lacks the available space instead of relying
> > > on
> > > background rebalance procedure. Applicable for gift part only, reclaiming
> > > procedure can be executed as usual.
> > > 
> > > The (major?) issue with that IMO is the circular dependency that appears
> > > if we
> > > do request allocation from within BlueFS. The latter might need to
> > > allocate an
> > > additional space in response to some RocksDB activity (e.g. compaction).
> > > Which
> > > will invoke Bluestore allocator followed by a sync write to DB to save
> > > FreelistManager(FM) data for newly allocated space. And I doubt RocksDB
> > > will
> > > handle this sequence properly.
> > > 
> > > Hence if we go this way we should break the circle. IMO one can do that by
> > > eliminating DB usage for BlueFS space tracking. In fact we have full
> > > BlueFS
> > > extent list replica within BlueFS itself - available via
> > > BlueFS::get_block_extents() call. So let's start using it for BlueStore
> > > FM/Allocator init (surely in respect of BlueFS extents only) rather than
> > > read
> > > from DB. And hence no need to update FM/DB on BlueFS gifting allocation.
> > > 
> > > There is an additional mean (BlueStore::_reconcile_bluefs_freespace()) to
> > > sync
> > > BlueFS extent lists tracked by both FM and BlueFS. This is to handle
> > > potential
> > > unexpected termination after BlueStore allocation and before BlueFS log
> > > commit. But I think it's probably an overkill and we can rely exclusively
> > > on
> > > BlueFS log. Indeed on recovery we can simply behave as no corresponding
> > > Bluestore allocations have happened if BlueFS log wasn't committed. Look
> > > like
> > > there should be no valid BlueFS data in this area. And Bluestore can treat
> > > it
> > > as free.
> > > 
> > > Once this is resolved the following redesign sub-tasks seem to be more or
> > > less
> > > straightforward - on internal BlueFS allocation failure call Bluestore
> > > allocator, update the log and flush it in a regular manner (looks like
> > > there
> > > is no even need to do the immediate flush after such an allocation).
> > > 
> > > What do you think? Haven't I missed something crucial?
> > Yes!  I like this.
> > 
> > We could almost simplify this a bit further, in fact, and eliminate the
> > bluefs allocator entirely.  The problem with that is that it makes it
> > impossible to bring up bluefs without also bringing up all of bluestore,
> > and having bluefs quasi-independent is useful for using ceph-kvstore-tool
> > and such.
> Yeah, I like this BlueFS independence as well.
> 
> Another (may be a bit virtual though) concern about single allocator usage is
> potential fragmentation issues. Currently in the steady cluster state we have
> BlueStore allocator which might be pretty fragmented and BlueFS one which is
> not (as it operates 1M chunks only). And unless we do a rebalance (seldom
> operation in steady state) we don't care much about the lack of contiguous 1M
> extents (well, at least for non-spinners). But with single allocator these 1M
> chunks are allocated and freed within the global scope hence have much more
> probability to be splitted. So we might end-up with lack of contiguous extents
> much faster. Which will prevent BlueFS functioning at all. Well, yet another
> redesign topic about using small extents in BlueFS.....
> 
> > The problem with keeping them separate is that the (bluestore) freelist
> > state will get out of sync for the ranges in use by bluefs.  Well,
> > depending.  If we make an effort to make the freelist mark the bluefs
> > ranges in use, there is a two-phase process that may not always happen if
> > things shut down at an awkward time, which then requires a cleanup step on
> > startup to get things back in sync.  I think we can do that by enumerating
> > the ranges claimed by bluefs and make sure they are also marked used by
> > the freelist, making corrections as needed.
> IMO we can handle that in a simpler manner while still using separate
> allocators.
> BlueFS extents are always left unused in FM as you suggested below.
> BlueStore allocator is initialized on startup with both FM "allocated" and
> BlueFS in-use extents. Hence it has the full view and can allocate properly
> for both regular bluestore ops and "gift" requests from BlueFS. This procedure
> works fine for newly deployed clusters.
> 
> For existing cluster the only thing we need to do on the first run is to
> "free" FM slots corresponding to BlueFS in-use extents. We can inspect for
> "bluefs_extents" record presence to detect if we need to do such a cleanup.
> Then remove that record along with this FM cleanup transaction. Missed
> something?

My only concern with an ondisk compat change like this is we break 
downgrade (e.g., from 12.2.10 to 12.2.9 or whatever).  I think repeating 
the reconciliation on every startup is a small price to pay to avoid that 
concern.  Or, maybe we only repeat the reconciliation on mimic and 
luminous but not on nautilus?  Regardless, I think it is cheap: we've 
already loaded all the freelist state into memory.  It might not be 
worth the effort to skip it.

> > That seems like the minimal change, and a safe one to backport to M and L.
> > 
> > If we were doing this all over again, I wonder if we would consolidate on
> > a single allocator for both bluefs and bluestore, and leave the bluefs
> > ranges unused (no freelist updates needed; just trust bluefs's in-use
> > map).  I'm not sure if it's worth making that change for nautilus... what
> > do you think?
> Given all the above I think we can live with separate allocators for now. May
> be we should come back to this merge topic one day but currently I don't see
> much benefit in it.

Yeah, I agree... especially with the fragmentation and min_alloc_size 
mismatch issue you raised!

sage