Re: some thoughts on BlueFS space gift redesign

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 12 Oct 2018 19:06:50 +0300

On 10/12/2018 4:48 PM, Sage Weil wrote:
Hi Igor,

On Fri, 12 Oct 2018, Igor Fedotov wrote:
Hi Sage, et al.

Let me share some ideas on the subj.

The major rationale is to be able to perform on-demand allocation from the
main device space when BlueFS lacks the available space instead of relying on
background rebalance procedure. Applicable for gift part only, reclaiming
procedure can be executed as usual.

The (major?) issue with that IMO is the circular dependency that appears if we
do request allocation from within BlueFS. The latter might need to allocate an
additional space in response to some RocksDB activity (e.g. compaction). Which
will invoke Bluestore allocator followed by a sync write to DB to save
FreelistManager(FM) data for newly allocated space. And I doubt RocksDB will
handle this sequence properly.

Hence if we go this way we should break the circle. IMO one can do that by
eliminating DB usage for BlueFS space tracking. In fact we have full BlueFS
extent list replica within BlueFS itself - available via
BlueFS::get_block_extents() call. So let's start using it for BlueStore
FM/Allocator init (surely in respect of BlueFS extents only) rather than read
from DB. And hence no need to update FM/DB on BlueFS gifting allocation.

There is an additional mean (BlueStore::_reconcile_bluefs_freespace()) to sync
BlueFS extent lists tracked by both FM and BlueFS. This is to handle potential
unexpected termination after BlueStore allocation and before BlueFS log
commit. But I think it's probably an overkill and we can rely exclusively on
BlueFS log. Indeed on recovery we can simply behave as no corresponding
Bluestore allocations have happened if BlueFS log wasn't committed. Look like
there should be no valid BlueFS data in this area. And Bluestore can treat it
as free.

Once this is resolved the following redesign sub-tasks seem to be more or less
straightforward - on internal BlueFS allocation failure call Bluestore
allocator, update the log and flush it in a regular manner (looks like there
is no even need to do the immediate flush after such an allocation).

What do you think? Haven't I missed something crucial?
Yes!  I like this.

We could almost simplify this a bit further, in fact, and eliminate the
bluefs allocator entirely.  The problem with that is that it makes it
impossible to bring up bluefs without also bringing up all of bluestore,
and having bluefs quasi-independent is useful for using ceph-kvstore-tool
and such.
Yeah, I like this BlueFS independence as well.

Another (may be a bit virtual though) concern about single allocator 
usage is potential fragmentation issues. Currently in the steady cluster 
state we have BlueStore allocator which might be pretty fragmented and 
BlueFS one which is not (as it operates 1M chunks only). And unless we 
do a rebalance (seldom operation in steady state) we don't care much 
about the lack of contiguous 1M extents (well, at least for 
non-spinners). But with single allocator these 1M chunks are allocated 
and freed within the global scope hence have much more probability to be 
splitted. So we might end-up with lack of contiguous extents much 
faster. Which will prevent BlueFS functioning at all. Well, yet another 
redesign topic about using small extents in BlueFS.....

The problem with keeping them separate is that the (bluestore) freelist
state will get out of sync for the ranges in use by bluefs.  Well,
depending.  If we make an effort to make the freelist mark the bluefs
ranges in use, there is a two-phase process that may not always happen if
things shut down at an awkward time, which then requires a cleanup step on
startup to get things back in sync.  I think we can do that by enumerating
the ranges claimed by bluefs and make sure they are also marked used by
the freelist, making corrections as needed.
IMO we can handle that in a simpler manner while still using separate 
allocators.
BlueFS extents are always left unused in FM as you suggested below.
BlueStore allocator is initialized on startup with both FM "allocated" 
and BlueFS in-use extents. Hence it has the full view and can allocate 
properly for both regular bluestore ops and "gift" requests from BlueFS. 
This procedure works fine for newly deployed clusters.

For existing cluster the only thing we need to do on the first run is to 
"free" FM slots corresponding to BlueFS in-use extents. We can inspect 
for "bluefs_extents" record presence to detect if we need to do such a 
cleanup. Then remove that record along with this FM cleanup transaction. 
Missed something?

That seems like the minimal change, and a safe one to backport to M and L.

If we were doing this all over again, I wonder if we would consolidate on
a single allocator for both bluefs and bluestore, and leave the bluefs
ranges unused (no freelist updates needed; just trust bluefs's in-use
map).  I'm not sure if it's worth making that change for nautilus... what
do you think?
Given all the above I think we can live with separate allocators for 
now. May be we should come back to this merge topic one day but 
currently I don't see much benefit in it.

sage