Re: Bluestore different allocator performance Vs FileStore

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 11 Aug 2016 06:24:42 -0500

Ben England added parallel OSD creation to CBT a while back which 
greatly speed up cluster creation time (not just for the bitmap 
alloctaor).  I'm not sure if ceph-ansible creates OSDs in parallel, but 
if not he might have some insights into how easy it would be to improve it.

Mark

On 08/11/2016 02:11 AM, Somnath Roy wrote:
Yes, we can create OSDs in parallel but I am not sure how many people are creating cluster like that as ceph-deploy end there is no interface for that.
FYI, we have introduced some parallelism in SanDisk wrapper script for installer based on ceph-deploy.
I don't think even with all these parallel OSD creation, this problem will go away but for sure will be reduced  a bit as we have seen in case of OSD start time since it is inherently parallel.

Thanks & Regards
Somnath

-----Original Message-----
From: Ramesh Chander
Sent: Wednesday, August 10, 2016 11:07 PM
To: Allen Samuels; Sage Weil; Somnath Roy
Cc: ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

Somnath,

Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes ( 32 / 16).

But is there a reason you should create osds in serial? I think for mmultiple osds mkfs can happen in parallel?

As a fix I am looking to batch multiple insert_free calls for now. If still that does not help, thinking of doing insert_free on different part of device in parallel.

-Ramesh

-----Original Message-----
From: Ramesh Chander
Sent: Thursday, August 11, 2016 10:04 AM
To: Allen Samuels; Sage Weil; Somnath Roy
Cc: ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

I think insert_free is limited by speed of function clear_bits here.

Though set_bits and clear_bits have same logic except one sets and
another clears. Both of these does 64 bits (bitmap size) at a time.

I am not sure if doing memset will make it faster. But if we can do it
for group of bitmaps, then it might help.

I am looking in to code if we can handle mkfs and osd mount in special
way to make it faster.

If I don't find an easy fix, we can go to path of deferring init to
later stage as and when required.

-Ramesh

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels
Sent: Thursday, August 11, 2016 4:28 AM
To: Sage Weil; Somnath Roy
Cc: ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

We always knew that startup time for bitmap stuff would be somewhat
longer. Still, the existing implementation can be speeded up
significantly. The code in BitMapZone::set_blocks_used isn't very
optimized. Converting it to use memset for all but the first/last
bytes
should significantly speed it up.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
Sent: Wednesday, August 10, 2016 3:44 PM
To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Subject: RE: Bluestore different allocator performance Vs
FileStore

On Wed, 10 Aug 2016, Somnath Roy wrote:
<< inline with [Somnath]

-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
Sent: Wednesday, August 10, 2016 2:31 PM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Bluestore different allocator performance Vs
FileStore

On Wed, 10 Aug 2016, Somnath Roy wrote:
Hi, I spent some time on evaluating different Bluestore
allocator and freelist performance. Also, tried to gaze the
performance difference of Bluestore and filestore on the
similar
setup.

Setup:
--------

16 OSDs (8TB Flash) across 2 OSD nodes

Single pool and single rbd image of 4TB. 2X replication.

Disabled the exclusive lock feature so that I can run multiple
write  jobs in
parallel.
rbd_cache is disabled in the client side.
Each test ran for 15 mins.

Result :
---------

Here is the detailed report on this.

https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
25 0cb05986/Bluestore_allocator_comp.xlsx

Each profile I named based on <allocator>-<freelist> , so in
the graph for
example "stupid-extent" meaning stupid allocator and extent freelist.

I ran the test for each of the profile in the following order
after creating a
fresh rbd image for all the Bluestore test.

1. 4K RW for 15 min with 16QD and 10 jobs.

2. 16K RW for 15 min with 16QD and 10 jobs.

3. 64K RW for 15 min with 16QD and 10 jobs.

4. 256K RW for 15 min with 16QD and 10 jobs.

The above are non-preconditioned case i.e ran before filling
up the entire
image. The reason is I don't see any reason of filling up the rbd
image before like filestore case where it will give stable
performance if we fill up the rbd images first. Filling up rbd
images in case of filestore will create the files in the filesystem.

5. Next, I did precondition the 4TB image with 1M seq write.
This is
primarily because I want to load BlueStore with more data.

6. Ran 4K RW test again (this is called out preconditioned in
the
profile) for 15 min

7. Ran 4K Seq test for similar QD for 15 min

8. Ran 16K RW test again for 15min

For filestore test, I ran tests after preconditioning the
entire image
first.

Each sheet on the xls have different block size result , I
often miss to navigate through the xls sheets , so, thought of
mentioning here
:-)

I have also captured the mkfs time , OSD startup time and the
memory
usage after the entire run.

Observation:
---------------

1. First of all, in case of bitmap allocator mkfs time (and
thus cluster
creation time for 16 OSDs) are ~16X slower than stupid allocator
and
filestore.
Each OSD creation is taking ~2min or so sometimes and I nailed
down the
insert_free() function call (marked ****) in the Bitmap allocator
is causing that.

2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
enumerate_next start
2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
enumerate_next
0x4663d00000~69959451000
2016-08-05 16:12:40.975555 7f4024d258c0 10
bitmapalloc:init_add_free instance 139913322803328 offset
0x4663d00000 length 0x69959451000
****2016-08-05 16:12:40.975557 7f4024d258c0 20
bitmapalloc:insert_free instance 139913322803328 off
0x4663d00000 len 0x69959451000****
****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
enumerate_next
end****
2016-08-05 16:13:20.748978 7f4024d258c0 10
bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
in
1 extents

2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
read buffered 0x4a14eb~265 of ^A:5242880+5242880
2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
got
613
2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
enumerate_next
0x4663d00000~69959451000
2016-08-05 16:13:23.438664 7f4024d258c0 10
bitmapalloc:init_add_free instance 139913306273920 offset
0x4663d00000 length 0x69959451000
*****2016-08-05 16:13:23.438666 7f4024d258c0 20
bitmapalloc:insert_free instance 139913306273920 off
0x4663d00000 len
0x69959451000*****
*****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
enumerate_next end

I'm not sure there's any easy fix for this. We can amortize it
by feeding
space to bluefs slowly (so that we don't have to do all the
inserts at once), but I'm not sure that's really better.

[Somnath] I don't know that part of the code, so, may be a dumb
question.
This is during mkfs() time , so, can't we say to bluefs entire
space is free ? I can understand for osd mount and all other cases
we need to feed the free space every time.
IMO this is critical to fix as cluster creation time will be
number of OSDs * 2
min otherwise. For me creating 16 OSDs cluster is taking ~32min
compare to
~2 min for stupid allocator/filestore.
BTW, my drive data partition is ~6.9TB , db partition is ~100G
and WAL is
~1G. I guess the time taking is dependent on data partition size as well (?

Well, we're fundamentally limited by the fact that it's a bitmap,
and a big chunk of space is "allocated" to bluefs and needs to have 1's set.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html