Re: some thoughts on BlueFS space gift redesign

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 10/12/2018 4:48 PM, Sage Weil wrote:
Hi Igor,

On Fri, 12 Oct 2018, Igor Fedotov wrote:
Hi Sage, et al.

Let me share some ideas on the subj.

The major rationale is to be able to perform on-demand allocation from the
main device space when BlueFS lacks the available space instead of relying on
background rebalance procedure. Applicable for gift part only, reclaiming
procedure can be executed as usual.

The (major?) issue with that IMO is the circular dependency that appears if we
do request allocation from within BlueFS. The latter might need to allocate an
additional space in response to some RocksDB activity (e.g. compaction). Which
will invoke Bluestore allocator followed by a sync write to DB to save
FreelistManager(FM) data for newly allocated space. And I doubt RocksDB will
handle this sequence properly.

Hence if we go this way we should break the circle. IMO one can do that by
eliminating DB usage for BlueFS space tracking. In fact we have full BlueFS
extent list replica within BlueFS itself - available via
BlueFS::get_block_extents() call. So let's start using it for BlueStore
FM/Allocator init (surely in respect of BlueFS extents only) rather than read
from DB. And hence no need to update FM/DB on BlueFS gifting allocation.

There is an additional mean (BlueStore::_reconcile_bluefs_freespace()) to sync
BlueFS extent lists tracked by both FM and BlueFS. This is to handle potential
unexpected termination after BlueStore allocation and before BlueFS log
commit. But I think it's probably an overkill and we can rely exclusively on
BlueFS log. Indeed on recovery we can simply behave as no corresponding
Bluestore allocations have happened if BlueFS log wasn't committed. Look like
there should be no valid BlueFS data in this area. And Bluestore can treat it
as free.

Once this is resolved the following redesign sub-tasks seem to be more or less
straightforward - on internal BlueFS allocation failure call Bluestore
allocator, update the log and flush it in a regular manner (looks like there
is no even need to do the immediate flush after such an allocation).

What do you think? Haven't I missed something crucial?
Yes!  I like this.

We could almost simplify this a bit further, in fact, and eliminate the
bluefs allocator entirely.  The problem with that is that it makes it
impossible to bring up bluefs without also bringing up all of bluestore,
and having bluefs quasi-independent is useful for using ceph-kvstore-tool
and such.
Yeah, I like this BlueFS independence as well.

Another (may be a bit virtual though) concern about single allocator usage is potential fragmentation issues. Currently in the steady cluster state we have BlueStore allocator which might be pretty fragmented and BlueFS one which is not (as it operates 1M chunks only). And unless we do a rebalance (seldom operation in steady state) we don't care much about the lack of contiguous 1M extents (well, at least for non-spinners). But with single allocator these 1M chunks are allocated and freed within the global scope hence have much more probability to be splitted. So we might end-up with lack of contiguous extents much faster. Which will prevent BlueFS functioning at all. Well, yet another redesign topic about using small extents in BlueFS.....

The problem with keeping them separate is that the (bluestore) freelist
state will get out of sync for the ranges in use by bluefs.  Well,
depending.  If we make an effort to make the freelist mark the bluefs
ranges in use, there is a two-phase process that may not always happen if
things shut down at an awkward time, which then requires a cleanup step on
startup to get things back in sync.  I think we can do that by enumerating
the ranges claimed by bluefs and make sure they are also marked used by
the freelist, making corrections as needed.
IMO we can handle that in a simpler manner while still using separate allocators.
BlueFS extents are always left unused in FM as you suggested below.
BlueStore allocator is initialized on startup with both FM "allocated" and BlueFS in-use extents. Hence it has the full view and can allocate properly for both regular bluestore ops and "gift" requests from BlueFS. This procedure works fine for newly deployed clusters.

For existing cluster the only thing we need to do on the first run is to "free" FM slots corresponding to BlueFS in-use extents. We can inspect for "bluefs_extents" record presence to detect if we need to do such a cleanup. Then remove that record along with this FM cleanup transaction. Missed something?


That seems like the minimal change, and a safe one to backport to M and L.

If we were doing this all over again, I wonder if we would consolidate on
a single allocator for both bluefs and bluestore, and leave the bluefs
ranges unused (no freelist updates needed; just trust bluefs's in-use
map).  I'm not sure if it's worth making that change for nautilus... what
do you think?
Given all the above I think we can live with separate allocators for now. May be we should come back to this merge topic one day but currently I don't see much benefit in it.

sage




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux