Re: Bluestore different allocator performance Vs FileStore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 10, 2016 at 6:44 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Wed, 10 Aug 2016, Somnath Roy wrote:
>> << inline with [Somnath]
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
>> Sent: Wednesday, August 10, 2016 2:31 PM
>> To: Somnath Roy
>> Cc: ceph-devel
>> Subject: Re: Bluestore different allocator performance Vs FileStore
>>
>> On Wed, 10 Aug 2016, Somnath Roy wrote:
>> > Hi, I spent some time on evaluating different Bluestore allocator and
>> > freelist performance. Also, tried to gaze the performance difference
>> > of Bluestore and filestore on the similar setup.
>> >
>> > Setup:
>> > --------
>> >
>> > 16 OSDs (8TB Flash) across 2 OSD nodes
>> >
>> > Single pool and single rbd image of 4TB. 2X replication.
>> >
>> > Disabled the exclusive lock feature so that I can run multiple write  jobs in parallel.
>> > rbd_cache is disabled in the client side.
>> > Each test ran for 15 mins.
>> >
>> > Result :
>> > ---------
>> >
>> > Here is the detailed report on this.
>> >
>> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a25
>> > 0cb05986/Bluestore_allocator_comp.xlsx
>> >
>> > Each profile I named based on <allocator>-<freelist> , so in the graph for example "stupid-extent" meaning stupid allocator and extent freelist.
>> >
>> > I ran the test for each of the profile in the following order after creating a fresh rbd image for all the Bluestore test.
>> >
>> > 1. 4K RW for 15 min with 16QD and 10 jobs.
>> >
>> > 2. 16K RW for 15 min with 16QD and 10 jobs.
>> >
>> > 3. 64K RW for 15 min with 16QD and 10 jobs.
>> >
>> > 4. 256K RW for 15 min with 16QD and 10 jobs.
>> >
>> > The above are non-preconditioned case i.e ran before filling up the entire image. The reason is I don't see any reason of filling up the rbd image before like filestore case where it will give stable performance if we fill up the rbd images first. Filling up rbd images in case of filestore will create the files in the filesystem.
>> >
>> > 5. Next, I did precondition the 4TB image with 1M seq write. This is primarily because I want to load BlueStore with more data.
>> >
>> > 6. Ran 4K RW test again (this is called out preconditioned in the
>> > profile) for 15 min
>> >
>> > 7. Ran 4K Seq test for similar QD for 15 min
>> >
>> > 8. Ran 16K RW test again for 15min
>> >
>> > For filestore test, I ran tests after preconditioning the entire image first.
>> >
>> > Each sheet on the xls have different block size result , I often miss
>> > to navigate through the xls sheets , so, thought of mentioning here
>> > :-)
>> >
>> > I have also captured the mkfs time , OSD startup time and the memory usage after the entire run.
>> >
>> > Observation:
>> > ---------------
>> >
>> > 1. First of all, in case of bitmap allocator mkfs time (and thus cluster creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore. Each OSD creation is taking ~2min or so sometimes and I nailed down the insert_free() function call (marked ****) in the Bitmap allocator is causing that.
>> >
>> > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next
>> > start
>> > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next
>> > 0x4663d00000~69959451000
>> > 2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free
>> > instance 139913322803328 offset 0x4663d00000 length 0x69959451000
>> > ****2016-08-05 16:12:40.975557 7f4024d258c0 20 bitmapalloc:insert_free
>> > instance 139913322803328 off 0x4663d00000 len 0x69959451000****
>> > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist enumerate_next
>> > end****
>> > 2016-08-05 16:13:20.748978 7f4024d258c0 10
>> > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1
>> > extents
>> >
>> > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read
>> > buffered 0x4a14eb~265 of ^A:5242880+5242880
>> > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got 613
>> > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next
>> > 0x4663d00000~69959451000
>> > 2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free
>> > instance 139913306273920 offset 0x4663d00000 length 0x69959451000
>> > *****2016-08-05 16:13:23.438666 7f4024d258c0 20
>> > bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000 len
>> > 0x69959451000*****
>> > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
>> > enumerate_next end
>>
>> I'm not sure there's any easy fix for this. We can amortize it by feeding space to bluefs slowly (so that we don't have to do all the inserts at once), but I'm not sure that's really better.
>>
>> [Somnath] I don't know that part of the code, so, may be a dumb question. This is during mkfs() time , so, can't we say to bluefs entire space is free ? I can understand for osd mount and all other cases we need to feed the free space every time.
>> IMO this is critical to fix as cluster creation time will be number of OSDs * 2 min otherwise. For me creating 16 OSDs cluster is taking ~32min compare to ~2 min for stupid allocator/filestore.
>> BTW, my drive data partition is ~6.9TB , db partition is ~100G and WAL is ~1G. I guess the time taking is dependent on data partition size as well (?
>
> Well, we're fundamentally limited by the fact that it's a bitmap, and a
> big chunk of space is "allocated" to bluefs and needs to have 1's set.

There's been a lot of research into compressed bitmaps (disk and
memory) in the last 10 years steaming from database index research.
Some of them can be decompressed at near memcpy speeds.

The current "best" method compressed compressed bitmap when you
require editing is Roaring bitmaps. Link http://roaringbitmap.org/ and
links to research http://arxiv.org/pdf/1603.06549.pdf ,
http://arxiv.org/pdf/1402.6407.pdf .

This could be useful, not only for creation of the partition but also
minimizing memory usage at runtime. And in the default case, you can
reserve as much space needed for the worst case bitmap (no
compression) but in most cases end up using only a fraction of it.

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux