RE: Bluestore different allocator performance Vs FileStore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Is there a simple way to detect whether you're in initialization/not? If so, you could augment the debug_asserts to skip the is_allocated during initialization but re-enable them during normal operation.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 11, 2016 8:10 PM
> To: Sage Weil <sage@xxxxxxxxxxxx>; Allen Samuels
> <Allen.Samuels@xxxxxxxxxxx>
> Cc: Ramesh Chander <Ramesh.Chander@xxxxxxxxxxx>; ceph-devel <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> Sage,
> I tried your PR but it is not helping much. See this each insert_free() call is
> taking ~40sec to complete and we have 2 calls that is taking time..
> 
> 2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free
> instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
> 2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free instance
> 140128595341440 off 0x2000 len 0x6ab7d14f000
> 2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear bits in
> 0x6ab7d100000
> 
> 2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free
> instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
> 2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free instance
> 140127837929472 off 0x2000 len 0x6ab7d14f000
> 2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear bits in
> 0x6ab7d100000
> 
> I have also tried with the following and it is not helping either..
> 
>        bluestore_bluefs_min_ratio = .01
>         bluestore_freelist_blocks_per_key = 512
> 
> 
> I did some debugging on this to find out which call inside this function is
> taking time and I found this within BitAllocator::free_blocks
> 
>   debug_assert(is_allocated(start_block, num_blocks));
> 
>   free_blocks_int(start_block, num_blocks);
> 
> I did skip this debug_assert and total time reduced from ~80sec ~49sec , so,
> that's a significant improvement.
> 
> Next, I found out that debug_assert(is_allocated()) is called from
> free_blocks_int as well. I commented out blindly all
> debug_assert(is_allocated()) and performance became similar to
> stupid/filestore.
> I didn't bother to look into is_allocated() anymore as my guess is we can
> safely ignore this during mkfs() time ?
> But, it will be good if we can optimize this as it may induce latency in the IO
> path (?).
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> Sent: Thursday, August 11, 2016 2:20 PM
> To: Allen Samuels
> Cc: Ramesh Chander; Somnath Roy; ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> > > Sent: Thursday, August 11, 2016 1:24 PM
> > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> > > Cc: Ramesh Chander <Ramesh.Chander@xxxxxxxxxxx>; Somnath Roy
> > > <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
> devel@xxxxxxxxxxxxxxx>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > clarify as needed.
> > > >
> > > > I thought that the authoritative indication of space used by
> > > > BlueFS was contained in the snapshot/journal of BlueFS itself, NOT
> > > > in the KV store itself. This requires that upon startup, we replay
> > > > the BlueFS snapshot/journal into the FreeListManager so that it
> > > > properly records the consumption of BlueFS space (since that
> > > > allocation MAY NOT be accurate within the FreeListmanager itself).
> > > > But that this playback need not generate an KVStore operations
> > > > (since those are duplicates of the BlueFS).
> > > >
> > > > So in the code you cite:
> > > >
> > > > fm->allocate(0, reserved, t);
> > > >
> > > > There's no need to commit 't', and in fact, in the general case,
> > > > you don't want to commit 't'.
> > > >
> > > > That suggests to me that a version of allocate that doesn't have a
> > > > transaction could be easily created would have the speed we're
> > > > looking for (and independence from the BitMapAllocator to KVStore
> chunking).
> > >
> > > Oh, I see.  Yeah, you're right--this step isn't really necessary, as
> > > long as we ensure that the auxilliary representation of what bluefs
> > > owns (bluefs_extents in the superblock) is still passed into the
> > > Allocator during initialization.  Having the freelist reflect the
> > > allocator that this space was "in use" (by bluefs) and thus off
> > > limits to bluestore is simple but not strictly necessary.
> > >
> > > I'll work on a PR that avoids this...
> 
> https://github.com/ceph/ceph/pull/10698
> 
> Ramesh, can you give it a try?
> 
> > > > I suspect that we also have long startup times because we're doing
> > > > the same underlying bitmap operations except they come from the
> > > > BlueFS replay code instead of the BlueFS initialization code, but
> > > > same problem with likely the same fix.
> > >
> > > BlueFS doesn't touch the FreelistManager (or explicitly persist the
> > > freelist at all)... we initialize the in-memory Allocator state from
> > > the metadata in the bluefs log.  I think we should be fine on this end.
> >
> > Likely that code suffers from the same problem -- a false need to
> > update the KV Store (During the playback, BlueFS extents are converted
> > to bitmap runs, it's essentially the same lower level code as the case
> > we're seeing now, but it instead of being driven by an artificial "big
> > run", it'sll be driven from the BlueFS Journal replay code). But
> > that's just a guess, I don't have time to track down the actual code right
> now.
> 
> BlueFS can't touch the freelist (or kv store, ever) since it ultimately backs the
> kv store and that would be problematic.  We do initialize the bluefs
> Allocator's in-memory state, but that's it.
> 
> The PR above changes the BlueStore::_init_alloc() so that BlueStore's
> Allocator state is initialize with both the freelist state (from kv store)
> *and* the bluefs_extents list (from the bluestore superblock).  (From this
> Allocator's perspective, all of bluefs's space is allocated and can't be used.
> BlueFS has it's own separate instance to do it's internal
> allocations.)
> 
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux