FYI, with latest master (optimized is_allocated()) the osd uptime and mkfs() time is *almost* (~2.5X slower now compare to 16X) similar to stupid. As we discussed today, removing debug_assert(is_allocated()) all together from mkfs() part should be resolving this gap as well... Thanks & Regards Somnath -----Original Message----- From: Somnath Roy Sent: Friday, August 12, 2016 8:44 AM To: 'Sage Weil'; Ramesh Chander Cc: Allen Samuels; ceph-devel Subject: RE: Bluestore different allocator performance Vs FileStore -----Original Message----- From: Sage Weil [mailto:sage@xxxxxxxxxxxx] Sent: Friday, August 12, 2016 8:26 AM To: Ramesh Chander Cc: Allen Samuels; Somnath Roy; ceph-devel Subject: RE: Bluestore different allocator performance Vs FileStore On Thu, 11 Aug 2016, Sage Weil wrote: > On Thu, 11 Aug 2016, Allen Samuels wrote: > > > -----Original Message----- > > > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > > > Sent: Thursday, August 11, 2016 1:24 PM > > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > > > Cc: Ramesh Chander <Ramesh.Chander@xxxxxxxxxxx>; Somnath Roy > > > <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > > > Subject: RE: Bluestore different allocator performance Vs > > > FileStore > > > > > > On Thu, 11 Aug 2016, Allen Samuels wrote: > > > > Perhaps my understanding of the blueFS is incorrect -- so please > > > > clarify as needed. > > > > > > > > I thought that the authoritative indication of space used by > > > > BlueFS was contained in the snapshot/journal of BlueFS itself, > > > > NOT in the KV store itself. This requires that upon startup, we > > > > replay the BlueFS snapshot/journal into the FreeListManager so > > > > that it properly records the consumption of BlueFS space (since > > > > that allocation MAY NOT be accurate within the FreeListmanager > > > > itself). But that this playback need not generate an KVStore > > > > operations (since those are duplicates of the BlueFS). > > > > > > > > So in the code you cite: > > > > > > > > fm->allocate(0, reserved, t); > > > > > > > > There's no need to commit 't', and in fact, in the general case, > > > > you don't want to commit 't'. > > > > > > > > That suggests to me that a version of allocate that doesn't have > > > > a transaction could be easily created would have the speed we're > > > > looking for (and independence from the BitMapAllocator to KVStore chunking). > > > > > > Oh, I see. Yeah, you're right--this step isn't really necessary, > > > as long as we ensure that the auxilliary representation of what > > > bluefs owns (bluefs_extents in the superblock) is still passed > > > into the Allocator during initialization. Having the freelist > > > reflect the allocator that this space was "in use" (by bluefs) and > > > thus off limits to bluestore is simple but not strictly necessary. > > > > > > I'll work on a PR that avoids this... > > https://github.com/ceph/ceph/pull/10698 > > Ramesh, can you give it a try? > > > > > I suspect that we also have long startup times because we're > > > > doing the same underlying bitmap operations except they come > > > > from the BlueFS replay code instead of the BlueFS initialization > > > > code, but same problem with likely the same fix. > > > > > > BlueFS doesn't touch the FreelistManager (or explicitly persist > > > the freelist at all)... we initialize the in-memory Allocator > > > state from the metadata in the bluefs log. I think we should be fine on this end. > > > > Likely that code suffers from the same problem -- a false need to > > update the KV Store (During the playback, BlueFS extents are > > converted to bitmap runs, it's essentially the same lower level code > > as the case we're seeing now, but it instead of being driven by an > > artificial "big run", it'sll be driven from the BlueFS Journal > > replay code). But that's just a guess, I don't have time to track down the actual code right now. > > BlueFS can't touch the freelist (or kv store, ever) since it > ultimately backs the kv store and that would be problematic. We do > initialize the bluefs Allocator's in-memory state, but that's it. > > The PR above changes the BlueStore::_init_alloc() so that BlueStore's > Allocator state is initialize with both the freelist state (from kv > store) > *and* the bluefs_extents list (from the bluestore superblock). (From > this Allocator's perspective, all of bluefs's space is allocated and > can't be used. BlueFS has it's own separate instance to do it's > internal > allocations.) Ah, okay, so after our conversation in standup I went and looked at the code some more and realized I've been thinking about the BitmapFreelistManager and not the BitMapAllocator. The ~40s is all CPU time spent updating in-memory bits, and has nothing to do with pushing updates through rocksdb. Sorry for the confusing conversation. So... I think there is one thing we can do: change the initialization of the allocator state from the freelist so that the assumption is that space is freed and we tell it was is allocation (currently we assume everything is allocated and tell it what is free). I'm not sure it's worth it, though: we'll just make things slower to start up on a full OSD instead of slower on an empty OSD. And it seems like the CPU time really won't be significant anyway once the debugging stuff is taken out. I think this PR https://github.com/ceph/ceph/pull/10698 is still a good idea, though, since it avoids useless freelist kv work during mkfs. Does that sound right? Or am I still missing something? Thanks for you patience! [Somnath] Yes, I think it is a good idea , it seems it will reduce some kv operation from IO path as well because _balance_bluefs_freespace is in IO path (?) sage PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html