We always knew that startup time for bitmap stuff would be somewhat longer. Still, the existing implementation can be speeded up significantly. The code in BitMapZone::set_blocks_used isn't very optimized. Converting it to use memset for all but the first/last bytes should significantly speed it up. Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > Sent: Wednesday, August 10, 2016 3:44 PM > To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: RE: Bluestore different allocator performance Vs FileStore > > On Wed, 10 Aug 2016, Somnath Roy wrote: > > << inline with [Somnath] > > > > -----Original Message----- > > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > > Sent: Wednesday, August 10, 2016 2:31 PM > > To: Somnath Roy > > Cc: ceph-devel > > Subject: Re: Bluestore different allocator performance Vs FileStore > > > > On Wed, 10 Aug 2016, Somnath Roy wrote: > > > Hi, I spent some time on evaluating different Bluestore allocator > > > and freelist performance. Also, tried to gaze the performance > > > difference of Bluestore and filestore on the similar setup. > > > > > > Setup: > > > -------- > > > > > > 16 OSDs (8TB Flash) across 2 OSD nodes > > > > > > Single pool and single rbd image of 4TB. 2X replication. > > > > > > Disabled the exclusive lock feature so that I can run multiple write jobs in > parallel. > > > rbd_cache is disabled in the client side. > > > Each test ran for 15 mins. > > > > > > Result : > > > --------- > > > > > > Here is the detailed report on this. > > > > > > > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a > > > 25 0cb05986/Bluestore_allocator_comp.xlsx > > > > > > Each profile I named based on <allocator>-<freelist> , so in the graph for > example "stupid-extent" meaning stupid allocator and extent freelist. > > > > > > I ran the test for each of the profile in the following order after creating a > fresh rbd image for all the Bluestore test. > > > > > > 1. 4K RW for 15 min with 16QD and 10 jobs. > > > > > > 2. 16K RW for 15 min with 16QD and 10 jobs. > > > > > > 3. 64K RW for 15 min with 16QD and 10 jobs. > > > > > > 4. 256K RW for 15 min with 16QD and 10 jobs. > > > > > > The above are non-preconditioned case i.e ran before filling up the entire > image. The reason is I don't see any reason of filling up the rbd image before > like filestore case where it will give stable performance if we fill up the rbd > images first. Filling up rbd images in case of filestore will create the files in the > filesystem. > > > > > > 5. Next, I did precondition the 4TB image with 1M seq write. This is > primarily because I want to load BlueStore with more data. > > > > > > 6. Ran 4K RW test again (this is called out preconditioned in the > > > profile) for 15 min > > > > > > 7. Ran 4K Seq test for similar QD for 15 min > > > > > > 8. Ran 16K RW test again for 15min > > > > > > For filestore test, I ran tests after preconditioning the entire image first. > > > > > > Each sheet on the xls have different block size result , I often > > > miss to navigate through the xls sheets , so, thought of mentioning > > > here > > > :-) > > > > > > I have also captured the mkfs time , OSD startup time and the memory > usage after the entire run. > > > > > > Observation: > > > --------------- > > > > > > 1. First of all, in case of bitmap allocator mkfs time (and thus cluster > creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore. > Each OSD creation is taking ~2min or so sometimes and I nailed down the > insert_free() function call (marked ****) in the Bitmap allocator is causing > that. > > > > > > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next > > > start > > > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next > > > 0x4663d00000~69959451000 > > > 2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free > > > instance 139913322803328 offset 0x4663d00000 length 0x69959451000 > > > ****2016-08-05 16:12:40.975557 7f4024d258c0 20 > > > bitmapalloc:insert_free instance 139913322803328 off 0x4663d00000 > > > len 0x69959451000**** > > > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist > > > enumerate_next > > > end**** > > > 2016-08-05 16:13:20.748978 7f4024d258c0 10 > > > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1 > > > extents > > > > > > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read > > > buffered 0x4a14eb~265 of ^A:5242880+5242880 > > > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got > > > 613 > > > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next > > > 0x4663d00000~69959451000 > > > 2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free > > > instance 139913306273920 offset 0x4663d00000 length 0x69959451000 > > > *****2016-08-05 16:13:23.438666 7f4024d258c0 20 > > > bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000 > > > len > > > 0x69959451000***** > > > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist > > > enumerate_next end > > > > I'm not sure there's any easy fix for this. We can amortize it by feeding > space to bluefs slowly (so that we don't have to do all the inserts at once), > but I'm not sure that's really better. > > > > [Somnath] I don't know that part of the code, so, may be a dumb question. > This is during mkfs() time , so, can't we say to bluefs entire space is free ? I > can understand for osd mount and all other cases we need to feed the free > space every time. > > IMO this is critical to fix as cluster creation time will be number of OSDs * 2 > min otherwise. For me creating 16 OSDs cluster is taking ~32min compare to > ~2 min for stupid allocator/filestore. > > BTW, my drive data partition is ~6.9TB , db partition is ~100G and WAL is > ~1G. I guess the time taking is dependent on data partition size as well (? > > Well, we're fundamentally limited by the fact that it's a bitmap, and a big > chunk of space is "allocated" to bluefs and needs to have 1's set. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html