Re: some thoughts on BlueFS space gift redesign

Sage Weil <sweil@xxxxxxxxxx> · Fri, 12 Oct 2018 13:48:27 +0000 (UTC)

Hi Igor,

On Fri, 12 Oct 2018, Igor Fedotov wrote:
> Hi Sage, et al.
> 
> Let me share some ideas on the subj.
> 
> The major rationale is to be able to perform on-demand allocation from the
> main device space when BlueFS lacks the available space instead of relying on
> background rebalance procedure. Applicable for gift part only, reclaiming
> procedure can be executed as usual.
> 
> The (major?) issue with that IMO is the circular dependency that appears if we
> do request allocation from within BlueFS. The latter might need to allocate an
> additional space in response to some RocksDB activity (e.g. compaction). Which
> will invoke Bluestore allocator followed by a sync write to DB to save
> FreelistManager(FM) data for newly allocated space. And I doubt RocksDB will
> handle this sequence properly.
> 
> Hence if we go this way we should break the circle. IMO one can do that by
> eliminating DB usage for BlueFS space tracking. In fact we have full BlueFS
> extent list replica within BlueFS itself - available via
> BlueFS::get_block_extents() call. So let's start using it for BlueStore
> FM/Allocator init (surely in respect of BlueFS extents only) rather than read
> from DB. And hence no need to update FM/DB on BlueFS gifting allocation.
> 
> There is an additional mean (BlueStore::_reconcile_bluefs_freespace()) to sync
> BlueFS extent lists tracked by both FM and BlueFS. This is to handle potential
> unexpected termination after BlueStore allocation and before BlueFS log
> commit. But I think it's probably an overkill and we can rely exclusively on
> BlueFS log. Indeed on recovery we can simply behave as no corresponding
> Bluestore allocations have happened if BlueFS log wasn't committed. Look like
> there should be no valid BlueFS data in this area. And Bluestore can treat it
> as free.
> 
> Once this is resolved the following redesign sub-tasks seem to be more or less
> straightforward - on internal BlueFS allocation failure call Bluestore
> allocator, update the log and flush it in a regular manner (looks like there
> is no even need to do the immediate flush after such an allocation).
> 
> What do you think? Haven't I missed something crucial?

Yes!  I like this.

We could almost simplify this a bit further, in fact, and eliminate the 
bluefs allocator entirely.  The problem with that is that it makes it 
impossible to bring up bluefs without also bringing up all of bluestore, 
and having bluefs quasi-independent is useful for using ceph-kvstore-tool 
and such.

The problem with keeping them separate is that the (bluestore) freelist 
state will get out of sync for the ranges in use by bluefs.  Well, 
depending.  If we make an effort to make the freelist mark the bluefs 
ranges in use, there is a two-phase process that may not always happen if 
things shut down at an awkward time, which then requires a cleanup step on 
startup to get things back in sync.  I think we can do that by enumerating 
the ranges claimed by bluefs and make sure they are also marked used by 
the freelist, making corrections as needed.

That seems like the minimal change, and a safe one to backport to M and L.

If we were doing this all over again, I wonder if we would consolidate on 
a single allocator for both bluefs and bluestore, and leave the bluefs 
ranges unused (no freelist updates needed; just trust bluefs's in-use 
map).  I'm not sure if it's worth making that change for nautilus... what 
do you think?

sage