RE: Bluestore different allocator performance Vs FileStore

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 11 Aug 2016 17:15:22 +0000 (UTC)

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> > Sent: Thursday, August 11, 2016 9:38 AM
> > To: Ramesh Chander <Ramesh.Chander@xxxxxxxxxxx>
> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Somnath Roy
> > <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> > 
> > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > I think the free list does not initialize all keys at mkfs time, it
> > > does sets key that has some allocations.
> > >
> > > Rest keys are assumed to have 0's if key does not exist.
> > 
> > Right.. it's the region "allocated" to bluefs that is consuming the time.
> > 
> > > The bitmap allocator insert_free is done in group of free bits
> > > together(maybe more than bitmap freelist keys at a time).
> > 
> > I think Allen is asking whether we are doing lots of inserts within a single
> > rocksdb transaction, or lots of separate transactions.
> > 
> > FWIW, my guess is that increasing the size of the value (i.e., increasing
> > 
> > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > 
> > ) will probably speed this up.
> 
> If your assumption (> Right.. it's the region "allocated" to bluefs that 
> is consuming the time) is correct, then I don't understand why this 
> parameter has any effect on the problem.
> 
> Aren't we reading BlueFS extents and setting them in the 
> BitMapAllocator? That doesn't care about the chunking of bitmap bits 
> into KV keys.

I think this is something different.  During mkfs we take ~2% (or 
somethign like that) of the block device, mark it 'allocated' (from the 
bluestore freelist's perspective) and give it to bluefs.  On a large 
device that's a lot of bits to set.  Larger keys should speed that up.

The amount of space we start with comes from _open_db():

      uint64_t initial =
	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
			    g_conf->bluestore_bluefs_gift_ratio);
      initial = MAX(initial, g_conf->bluestore_bluefs_min);

Simply lowering min_ratio might also be fine.  The current value of 2% is 
meant to be enough for most stores, and to avoid giving over lots of 
little extents later (and making the bluefs_extents list too big).  That 
can overflow the superblock, another annoying thing we need to fix (though 
not a big deal to fix).

Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve the time 
spent on this.. that is probably another useful test to confirm this is 
what is going on.

sage

> I would be cautious about just changing this option to affect this 
> problem (though as an experiment, we can change the value and see if it 
> has ANY affect on this problem -- which I don't think it will). The 
> value of this option really needs to be dictated by its effect on the 
> more mainstream read/write operations not on the initialization problem.
> > 
> > sage
> > 
> > 
> > >
> > > -Ramesh
> > >
> > > > -----Original Message-----
> > > > From: Allen Samuels
> > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > To: Ramesh Chander
> > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > Subject: Re: Bluestore different allocator performance Vs FileStore
> > > >
> > > > Is the initial creation of the keys for the bitmap one by one or are
> > > > they batched?
> > > >
> > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > >
> > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > <Ramesh.Chander@xxxxxxxxxxx> wrote:
> > > > >
> > > > > Somnath,
> > > > >
> > > > > Basically mkfs time has increased from 7.5 seconds (2min / 16) to
> > > > > 2 minutes
> > > > ( 32 / 16).
> > > > >
> > > > > But is there a reason you should create osds in serial? I think
> > > > > for mmultiple
> > > > osds mkfs can happen in parallel?
> > > > >
> > > > > As a fix I am looking to batch multiple insert_free calls for now.
> > > > > If still that
> > > > does not help, thinking of doing insert_free on different part of
> > > > device in parallel.
> > > > >
> > > > > -Ramesh
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Ramesh Chander
> > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > >> Cc: ceph-devel
> > > > >> Subject: RE: Bluestore different allocator performance Vs
> > > > >> FileStore
> > > > >>
> > > > >> I think insert_free is limited by speed of function clear_bits here.
> > > > >>
> > > > >> Though set_bits and clear_bits have same logic except one sets
> > > > >> and another clears. Both of these does 64 bits (bitmap size) at a time.
> > > > >>
> > > > >> I am not sure if doing memset will make it faster. But if we can
> > > > >> do it for group of bitmaps, then it might help.
> > > > >>
> > > > >> I am looking in to code if we can handle mkfs and osd mount in
> > > > >> special way to make it faster.
> > > > >>
> > > > >> If I don't find an easy fix, we can go to path of deferring init
> > > > >> to later stage as and when required.
> > > > >>
> > > > >> -Ramesh
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > > > >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels
> > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > >>> To: Sage Weil; Somnath Roy
> > > > >>> Cc: ceph-devel
> > > > >>> Subject: RE: Bluestore different allocator performance Vs
> > > > >>> FileStore
> > > > >>>
> > > > >>> We always knew that startup time for bitmap stuff would be
> > > > >>> somewhat longer. Still, the existing implementation can be
> > > > >>> speeded up significantly. The code in
> > > > >>> BitMapZone::set_blocks_used isn't very optimized. Converting it
> > > > >>> to use memset for all but the first/last bytes
> > > > >> should significantly speed it up.
> > > > >>>
> > > > >>>
> > > > >>> Allen Samuels
> > > > >>> SanDisk |a Western Digital brand
> > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> > > > >>>
> > > > >>>
> > > > >>>> -----Original Message-----
> > > > >>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > > > >>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > >>>> To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> > > > >>>> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> > > > >>>> Subject: RE: Bluestore different allocator performance Vs
> > > > >>>> FileStore
> > > > >>>>
> > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > >>>>> << inline with [Somnath]
> > > > >>>>>
> > > > >>>>> -----Original Message-----
> > > > >>>>> From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > >>>>> To: Somnath Roy
> > > > >>>>> Cc: ceph-devel
> > > > >>>>> Subject: Re: Bluestore different allocator performance Vs
> > > > >>>>> FileStore
> > > > >>>>>
> > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > >>>>>> Hi, I spent some time on evaluating different Bluestore
> > > > >>>>>> allocator and freelist performance. Also, tried to gaze the
> > > > >>>>>> performance difference of Bluestore and filestore on the
> > > > >>>>>> similar
> > > > >> setup.
> > > > >>>>>>
> > > > >>>>>> Setup:
> > > > >>>>>> --------
> > > > >>>>>>
> > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > >>>>>>
> > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > >>>>>>
> > > > >>>>>> Disabled the exclusive lock feature so that I can run
> > > > >>>>>> multiple write  jobs in
> > > > >>>> parallel.
> > > > >>>>>> rbd_cache is disabled in the client side.
> > > > >>>>>> Each test ran for 15 mins.
> > > > >>>>>>
> > > > >>>>>> Result :
> > > > >>>>>> ---------
> > > > >>>>>>
> > > > >>>>>> Here is the detailed report on this.
> > > > >>
> > > >
> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > >>>>>>
> > > > >>>>>> Each profile I named based on <allocator>-<freelist> , so in
> > > > >>>>>> the graph for
> > > > >>>> example "stupid-extent" meaning stupid allocator and extent
> > freelist.
> > > > >>>>>>
> > > > >>>>>> I ran the test for each of the profile in the following order
> > > > >>>>>> after creating a
> > > > >>>> fresh rbd image for all the Bluestore test.
> > > > >>>>>>
> > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> The above are non-preconditioned case i.e ran before filling
> > > > >>>>>> up the entire
> > > > >>>> image. The reason is I don't see any reason of filling up the
> > > > >>>> rbd image before like filestore case where it will give stable
> > > > >>>> performance if we fill up the rbd images first. Filling up rbd
> > > > >>>> images in case of filestore will create the files in the filesystem.
> > > > >>>>>>
> > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> > > > >>>>>> This is
> > > > >>>> primarily because I want to load BlueStore with more data.
> > > > >>>>>>
> > > > >>>>>> 6. Ran 4K RW test again (this is called out preconditioned in
> > > > >>>>>> the
> > > > >>>>>> profile) for 15 min
> > > > >>>>>>
> > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > >>>>>>
> > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > >>>>>>
> > > > >>>>>> For filestore test, I ran tests after preconditioning the
> > > > >>>>>> entire image
> > > > >> first.
> > > > >>>>>>
> > > > >>>>>> Each sheet on the xls have different block size result , I
> > > > >>>>>> often miss to navigate through the xls sheets , so, thought
> > > > >>>>>> of mentioning here
> > > > >>>>>> :-)
> > > > >>>>>>
> > > > >>>>>> I have also captured the mkfs time , OSD startup time and the
> > > > >>>>>> memory
> > > > >>>> usage after the entire run.
> > > > >>>>>>
> > > > >>>>>> Observation:
> > > > >>>>>> ---------------
> > > > >>>>>>
> > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs time (and
> > > > >>>>>> thus cluster
> > > > >>>> creation time for 16 OSDs) are ~16X slower than stupid
> > > > >>>> allocator and
> > > > >>> filestore.
> > > > >>>> Each OSD creation is taking ~2min or so sometimes and I nailed
> > > > >>>> down the
> > > > >>>> insert_free() function call (marked ****) in the Bitmap
> > > > >>>> allocator is causing that.
> > > > >>>>>>
> > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next start
> > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next
> > > > >>>>>> 0x4663d00000~69959451000
> > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next
> > > > >>>>>> end****
> > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
> > > > >>>>>> in
> > > > >>>>>> 1 extents
> > > > >>>>>>
> > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > ^A:5242880+5242880
> > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > >>>>>> _read_random got
> > > > >>>>>> 613
> > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next
> > > > >>>>>> 0x4663d00000~69959451000
> > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > > > >>>>>> 0x4663d00000 len
> > > > >>>>>> 0x69959451000*****
> > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next end
> > > > >>>>>
> > > > >>>>> I'm not sure there's any easy fix for this. We can amortize it
> > > > >>>>> by feeding
> > > > >>>> space to bluefs slowly (so that we don't have to do all the
> > > > >>>> inserts at once), but I'm not sure that's really better.
> > > > >>>>>
> > > > >>>>> [Somnath] I don't know that part of the code, so, may be a
> > > > >>>>> dumb
> > > > >>> question.
> > > > >>>> This is during mkfs() time , so, can't we say to bluefs entire
> > > > >>>> space is free ? I can understand for osd mount and all other
> > > > >>>> cases we need to feed the free space every time.
> > > > >>>>> IMO this is critical to fix as cluster creation time will be
> > > > >>>>> number of OSDs * 2
> > > > >>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > > > >>>> compare to
> > > > >>>> ~2 min for stupid allocator/filestore.
> > > > >>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G
> > > > >>>>> and WAL is
> > > > >>>> ~1G. I guess the time taking is dependent on data partition
> > > > >>>> size as well
> > > > (?
> > > > >>>>
> > > > >>>> Well, we're fundamentally limited by the fact that it's a
> > > > >>>> bitmap, and a big chunk of space is "allocated" to bluefs and
> > > > >>>> needs to have 1's
> > > > set.
> > > > >>>>
> > > > >>>> sage
> > > > >>>> --
> > > > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> > devel"
> > > > >>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > > >>> majordomo
> > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > > >> majordomo
> > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If the
> > reader of this message is not the intended recipient, you are hereby notified
> > that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly prohibited. If
> > you have received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and all
> > copies of this message in your possession (whether hard copies or
> > electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html