Re: Bluestore OSD_DATA, WAL & DB

Wido den Hollander <wido@xxxxxxxx> · Tue, 30 Jan 2018 10:23:15 +0100

On 11/03/2017 02:43 PM, Mark Nelson wrote:

On 11/03/2017 08:25 AM, Wido den Hollander wrote:

Op 3 november 2017 om 13:33 schreef Mark Nelson <mnelson@xxxxxxxxxx>:

On 11/03/2017 02:44 AM, Wido den Hollander wrote:

Op 3 november 2017 om 0:09 schreef Nigel Williams 
<nigel.williams@xxxxxxxxxxx>:

On 3 November 2017 at 07:45, Martin Overgaard Hansen 
<moh@xxxxxxxxxxxxx> wrote:
I want to bring this subject back in the light and hope someone 
can provide
insight regarding the issue, thanks.

Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.

It depends on the size of your backing disk. The DB will grow for 
the amount of Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same 
goes for a 10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB 
DB is rather hard to do. But if you have Billions of Objects and 
thus tens of millions object per OSD.

Are you doing RBD, RGW, or something else to test?  What size are the
objets and are you fragmenting them?

Let's say the avg overhead is 16k you would need a 150GB DB for 10M 
objects.

You could look into your current numbers and check how many objects 
you have per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects 
per OSD, but other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M 
objects in that OSD.

The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.

I would check your running Ceph clusters and calculate the amount of 
objects per OSD.

total objects / num osd * 3

One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects).  The space used per object might be different at
10M objects and 50M objects.

True. But how many systems do we have out there with 10M objects in 
ONE OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse, 
but statistics aren't the golden rule, but users will want some 
guideline on how to size their DB.

That's actually something I would really like better insight into.  I 
don't feel like I have a sufficient understanding of how many 
objects/OSD people are really deploying in the field.  I figure 10M/OSD 
is probably a reasonable "typical" upper limit for HDDs, but I could see 
some use cases with flash backed SSDs pushing far more.

A few months later I've gathered some more data and wrote a script to 
quickly query it on OSDs: 
https://gist.github.com/wido/875d531692a922d608b9392e1766405d

I fetched information from a few systems running with BlueStore.

So far the larged value I found on systems running with RBD is 24k per 
onode.

This OSD reported 70k onodes in it's database with a total DB size of 
about 1.5GB

As most deployments I see out there are RBD those are the ones I can get 
the most information from.

The avg object size I saw was 2.8MB.

So let's say you would like to fill a OSD with 2TB of data. With a avg 
object size of 2.8M you would have 714k objects on that OSD.

714k objects * 24k per onode = 16GB DB

The rule of thumb I've been using now is 10GB DB per 1TB of OSD storage. 
For now this seems to work out for me in all the cases I have seen.

I'm not saying it applies to every case, but the cases I've seen so faw 
seem to hold up.

If your average object size drops you will get more onodes per TB and 
thus have a larger DB.

I'm just trying to gather information so people designing their system 
have something to work with.

Wido

WAL should be sufficient with 1GB~2GB, right?

Yep.  On the surface this appears to be a simple question, but a much 
deeper question is what are we actually doing with the WAL?  How should 
we be storing PG log and dup ops data?  How can we get away from the 
large WAL buffers and memtables we have now?  These are questions we are 
actively working on solving.  For the moment though, having multiple (4) 
256MB WAL buffers appears to give us the best performance despite 
resulting in large memtables, so 1-2GB for the WAL is right.

Mark

Wido

Wido

An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com