Re: Bluestore OSD_DATA, WAL & DB

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 3 Nov 2017 07:33:25 -0500

On 11/03/2017 02:44 AM, Wido den Hollander wrote:

Op 3 november 2017 om 0:09 schreef Nigel Williams <nigel.williams@xxxxxxxxxxx>:

On 3 November 2017 at 07:45, Martin Overgaard Hansen <moh@xxxxxxxxxxxxx> wrote:
I want to bring this subject back in the light and hope someone can provide
insight regarding the issue, thanks.

Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.

It depends on the size of your backing disk. The DB will grow for the amount of Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is rather hard to do. But if you have Billions of Objects and thus tens of millions object per OSD.

Are you doing RBD, RGW, or something else to test?  What size are the 
objets and are you fragmenting them?

Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.

You could look into your current numbers and check how many objects you have per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.

The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.

I would check your running Ceph clusters and calculate the amount of objects per OSD.

total objects / num osd * 3

One nagging concern I have in the back of my mind is that the amount of 
space amplification in rocksdb might grow with the number of levels (ie 
the number of objects).  The space used per object might be different at 
10M objects and 50M objects.

Wido

An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com