some thoughts on BlueFS space gift redesign

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 12 Oct 2018 16:21:33 +0300

Hi Sage, et al.

Let me share some ideas on the subj.

The major rationale is to be able to perform on-demand allocation from 
the main device space when BlueFS lacks the available space instead of 
relying on background rebalance procedure. Applicable for gift part 
only, reclaiming procedure can be executed as usual.

The (major?) issue with that IMO is the circular dependency that appears 
if we do request allocation from within BlueFS. The latter might need to 
allocate an additional space in response to some RocksDB activity (e.g. 
compaction). Which will invoke Bluestore allocator followed by a sync 
write to DB to save FreelistManager(FM) data for newly allocated space. 
And I doubt RocksDB will handle this sequence properly.

Hence if we go this way we should break the circle. IMO one can do that 
by eliminating DB usage for BlueFS space tracking. In fact we have full 
BlueFS extent list replica within BlueFS itself - available via 
BlueFS::get_block_extents() call. So let's start using it for BlueStore 
FM/Allocator init (surely in respect of BlueFS extents only) rather than 
read from DB. And hence no need to update FM/DB on BlueFS gifting 
allocation.

There is an additional mean (BlueStore::_reconcile_bluefs_freespace()) 
to sync BlueFS extent lists tracked by both FM and BlueFS. This is to 
handle potential unexpected termination after BlueStore allocation and 
before BlueFS log commit. But I think it's probably an overkill and we 
can rely exclusively on BlueFS log. Indeed on recovery we can simply 
behave as no corresponding Bluestore allocations have happened if BlueFS 
log wasn't committed. Look like there should be no valid BlueFS data in 
this area. And Bluestore can treat it as free.

Once this is resolved the following redesign sub-tasks seem to be more 
or less straightforward - on internal BlueFS allocation failure call 
Bluestore allocator, update the log and flush it in a regular manner 
(looks like there is no even need to do the immediate flush after such 
an allocation).

What do you think? Haven't I missed something crucial?

Thanks,

Igor