On Fri, 12 Oct 2018, Igor Fedotov wrote: > On 10/12/2018 4:48 PM, Sage Weil wrote: > > Hi Igor, > > > > On Fri, 12 Oct 2018, Igor Fedotov wrote: > > > Hi Sage, et al. > > > > > > Let me share some ideas on the subj. > > > > > > The major rationale is to be able to perform on-demand allocation from the > > > main device space when BlueFS lacks the available space instead of relying > > > on > > > background rebalance procedure. Applicable for gift part only, reclaiming > > > procedure can be executed as usual. > > > > > > The (major?) issue with that IMO is the circular dependency that appears > > > if we > > > do request allocation from within BlueFS. The latter might need to > > > allocate an > > > additional space in response to some RocksDB activity (e.g. compaction). > > > Which > > > will invoke Bluestore allocator followed by a sync write to DB to save > > > FreelistManager(FM) data for newly allocated space. And I doubt RocksDB > > > will > > > handle this sequence properly. > > > > > > Hence if we go this way we should break the circle. IMO one can do that by > > > eliminating DB usage for BlueFS space tracking. In fact we have full > > > BlueFS > > > extent list replica within BlueFS itself - available via > > > BlueFS::get_block_extents() call. So let's start using it for BlueStore > > > FM/Allocator init (surely in respect of BlueFS extents only) rather than > > > read > > > from DB. And hence no need to update FM/DB on BlueFS gifting allocation. > > > > > > There is an additional mean (BlueStore::_reconcile_bluefs_freespace()) to > > > sync > > > BlueFS extent lists tracked by both FM and BlueFS. This is to handle > > > potential > > > unexpected termination after BlueStore allocation and before BlueFS log > > > commit. But I think it's probably an overkill and we can rely exclusively > > > on > > > BlueFS log. Indeed on recovery we can simply behave as no corresponding > > > Bluestore allocations have happened if BlueFS log wasn't committed. Look > > > like > > > there should be no valid BlueFS data in this area. And Bluestore can treat > > > it > > > as free. > > > > > > Once this is resolved the following redesign sub-tasks seem to be more or > > > less > > > straightforward - on internal BlueFS allocation failure call Bluestore > > > allocator, update the log and flush it in a regular manner (looks like > > > there > > > is no even need to do the immediate flush after such an allocation). > > > > > > What do you think? Haven't I missed something crucial? > > Yes! I like this. > > > > We could almost simplify this a bit further, in fact, and eliminate the > > bluefs allocator entirely. The problem with that is that it makes it > > impossible to bring up bluefs without also bringing up all of bluestore, > > and having bluefs quasi-independent is useful for using ceph-kvstore-tool > > and such. > Yeah, I like this BlueFS independence as well. > > Another (may be a bit virtual though) concern about single allocator usage is > potential fragmentation issues. Currently in the steady cluster state we have > BlueStore allocator which might be pretty fragmented and BlueFS one which is > not (as it operates 1M chunks only). And unless we do a rebalance (seldom > operation in steady state) we don't care much about the lack of contiguous 1M > extents (well, at least for non-spinners). But with single allocator these 1M > chunks are allocated and freed within the global scope hence have much more > probability to be splitted. So we might end-up with lack of contiguous extents > much faster. Which will prevent BlueFS functioning at all. Well, yet another > redesign topic about using small extents in BlueFS..... > > > The problem with keeping them separate is that the (bluestore) freelist > > state will get out of sync for the ranges in use by bluefs. Well, > > depending. If we make an effort to make the freelist mark the bluefs > > ranges in use, there is a two-phase process that may not always happen if > > things shut down at an awkward time, which then requires a cleanup step on > > startup to get things back in sync. I think we can do that by enumerating > > the ranges claimed by bluefs and make sure they are also marked used by > > the freelist, making corrections as needed. > IMO we can handle that in a simpler manner while still using separate > allocators. > BlueFS extents are always left unused in FM as you suggested below. > BlueStore allocator is initialized on startup with both FM "allocated" and > BlueFS in-use extents. Hence it has the full view and can allocate properly > for both regular bluestore ops and "gift" requests from BlueFS. This procedure > works fine for newly deployed clusters. > > For existing cluster the only thing we need to do on the first run is to > "free" FM slots corresponding to BlueFS in-use extents. We can inspect for > "bluefs_extents" record presence to detect if we need to do such a cleanup. > Then remove that record along with this FM cleanup transaction. Missed > something? My only concern with an ondisk compat change like this is we break downgrade (e.g., from 12.2.10 to 12.2.9 or whatever). I think repeating the reconciliation on every startup is a small price to pay to avoid that concern. Or, maybe we only repeat the reconciliation on mimic and luminous but not on nautilus? Regardless, I think it is cheap: we've already loaded all the freelist state into memory. It might not be worth the effort to skip it. > > That seems like the minimal change, and a safe one to backport to M and L. > > > > If we were doing this all over again, I wonder if we would consolidate on > > a single allocator for both bluefs and bluestore, and leave the bluefs > > ranges unused (no freelist updates needed; just trust bluefs's in-use > > map). I'm not sure if it's worth making that change for nautilus... what > > do you think? > Given all the above I think we can live with separate allocators for now. May > be we should come back to this merge topic one day but currently I don't see > much benefit in it. Yeah, I agree... especially with the fragmentation and min_alloc_size mismatch issue you raised! sage