One more finding Ramesh while debugging this.. I found in the BitAllocator.cc you have used /usr/include/assert.h. This will collide with dout() (that I was trying to introduce) and give compilation error. Eventually, I had to comment out <assert.h> and use ceph assert. Thanks & Regards Somnath -----Original Message----- From: Somnath Roy Sent: Thursday, August 11, 2016 8:10 PM To: 'Sage Weil'; Allen Samuels Cc: Ramesh Chander; ceph-devel Subject: RE: Bluestore different allocator performance Vs FileStore Sage, I tried your PR but it is not helping much. See this each insert_free() call is taking ~40sec to complete and we have 2 calls that is taking time.. 2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140128595341440 offset 0x2000 length 0x6ab7d14f000 2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free instance 140128595341440 off 0x2000 len 0x6ab7d14f000 2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist no more clear bits in 0x6ab7d100000 2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140127837929472 offset 0x2000 length 0x6ab7d14f000 2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free instance 140127837929472 off 0x2000 len 0x6ab7d14f000 2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist no more clear bits in 0x6ab7d100000 I have also tried with the following and it is not helping either.. bluestore_bluefs_min_ratio = .01 bluestore_freelist_blocks_per_key = 512 I did some debugging on this to find out which call inside this function is taking time and I found this within BitAllocator::free_blocks debug_assert(is_allocated(start_block, num_blocks)); free_blocks_int(start_block, num_blocks); I did skip this debug_assert and total time reduced from ~80sec ~49sec , so, that's a significant improvement. Next, I found out that debug_assert(is_allocated()) is called from free_blocks_int as well. I commented out blindly all debug_assert(is_allocated()) and performance became similar to stupid/filestore. I didn't bother to look into is_allocated() anymore as my guess is we can safely ignore this during mkfs() time ? But, it will be good if we can optimize this as it may induce latency in the IO path (?). Thanks & Regards Somnath -----Original Message----- From: Sage Weil [mailto:sage@xxxxxxxxxxxx] Sent: Thursday, August 11, 2016 2:20 PM To: Allen Samuels Cc: Ramesh Chander; Somnath Roy; ceph-devel Subject: RE: Bluestore different allocator performance Vs FileStore On Thu, 11 Aug 2016, Allen Samuels wrote: > > -----Original Message----- > > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > > Sent: Thursday, August 11, 2016 1:24 PM > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > > Cc: Ramesh Chander <Ramesh.Chander@xxxxxxxxxxx>; Somnath Roy > > <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > > Subject: RE: Bluestore different allocator performance Vs FileStore > > > > On Thu, 11 Aug 2016, Allen Samuels wrote: > > > Perhaps my understanding of the blueFS is incorrect -- so please > > > clarify as needed. > > > > > > I thought that the authoritative indication of space used by > > > BlueFS was contained in the snapshot/journal of BlueFS itself, NOT > > > in the KV store itself. This requires that upon startup, we replay > > > the BlueFS snapshot/journal into the FreeListManager so that it > > > properly records the consumption of BlueFS space (since that > > > allocation MAY NOT be accurate within the FreeListmanager itself). > > > But that this playback need not generate an KVStore operations > > > (since those are duplicates of the BlueFS). > > > > > > So in the code you cite: > > > > > > fm->allocate(0, reserved, t); > > > > > > There's no need to commit 't', and in fact, in the general case, > > > you don't want to commit 't'. > > > > > > That suggests to me that a version of allocate that doesn't have a > > > transaction could be easily created would have the speed we're > > > looking for (and independence from the BitMapAllocator to KVStore chunking). > > > > Oh, I see. Yeah, you're right--this step isn't really necessary, as > > long as we ensure that the auxilliary representation of what bluefs > > owns (bluefs_extents in the superblock) is still passed into the > > Allocator during initialization. Having the freelist reflect the > > allocator that this space was "in use" (by bluefs) and thus off > > limits to bluestore is simple but not strictly necessary. > > > > I'll work on a PR that avoids this... https://github.com/ceph/ceph/pull/10698 Ramesh, can you give it a try? > > > I suspect that we also have long startup times because we're doing > > > the same underlying bitmap operations except they come from the > > > BlueFS replay code instead of the BlueFS initialization code, but > > > same problem with likely the same fix. > > > > BlueFS doesn't touch the FreelistManager (or explicitly persist the > > freelist at all)... we initialize the in-memory Allocator state from > > the metadata in the bluefs log. I think we should be fine on this end. > > Likely that code suffers from the same problem -- a false need to > update the KV Store (During the playback, BlueFS extents are converted > to bitmap runs, it's essentially the same lower level code as the case > we're seeing now, but it instead of being driven by an artificial "big > run", it'sll be driven from the BlueFS Journal replay code). But > that's just a guess, I don't have time to track down the actual code right now. BlueFS can't touch the freelist (or kv store, ever) since it ultimately backs the kv store and that would be problematic. We do initialize the bluefs Allocator's in-memory state, but that's it. The PR above changes the BlueStore::_init_alloc() so that BlueStore's Allocator state is initialize with both the freelist state (from kv store) *and* the bluefs_extents list (from the bluestore superblock). (From this Allocator's perspective, all of bluefs's space is allocated and can't be used. BlueFS has it's own separate instance to do it's internal allocations.) sage PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html