Re: Ceph behavior on (lots of) small objects (RGW, RADOS + erasure coding)?

Paul Emmerich <paul.emmerich@xxxxxxxx> · Wed, 27 Jun 2018 22:19:21 +0200

Hi,

you are probably running into the bluestore min alloc size which is 64kb on HDDs and 16kb on SSDs. With k=5,m=2 you'd need at least 320kb objects on HDDs or 80kb objects on SSD to use the space efficiently.
Last time I checked these values were fixed on OSD creation and cannot be changed after creation.

It's not necessarily the best idea to store a lot of very small objects in RADOS (or cephfs or rgw), but it really depends on your exact requirements and access pattern.

Paul

2018-06-27 11:32 GMT+02:00 Nicolas Dandrimont <olasd@xxxxxxxxxxxxxxxxxxxx>:
Hi,

I would like to use ceph to store a lot of small objects. Our current usage

pattern is 4.5 billion unique objects, ranging from 0 to 100MB, with a median

size of 3-4kB. Overall, that's around 350 TB of raw data to store, which isn't

much, but that's across a *lot* of tiny files.

We expect a growth pattern of around at third per year, and the object size

distribution to sensibly stay the same (it's been stable for the past three

years, and we don't see that changing).

Our object access pattern is a very simple key -> value store, where the key

happens to be the sha1 of the content we're storing. Any metadata are stored

externally and we really only need a dumb object storage.

Our redundancy requirement is to be able to withstand the loss of 2 OSDs.

After looking at our options for storage in Ceph, I dismissed (perhaps hastily)

RGW for its metadata overhead, and went straight to plain RADOS. I've setup an

erasure coded storage pool, with default settings, with k=5 and m=2 (expecting

a 40% increase in storage use over plain contents).

After storing objects in the pool, I see a storage usage of 700% instead of

140%. My understanding of the erasure code profile docs[1] is that objects that

are below the stripe width (k * stripe_unit, which in my case is 20KB) can't be

chunked for erasure coding, which makes RADOS fall back to plain object

copying, with k+m copies.

[1] http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile/

Is my understanding correct? Does anyone have experience with this kind of

storage workload in Ceph?

If my understanding is correct, I'll end up adding size tiering on my object

storage layer, shuffling objects in two pools with different settings according

to their size. That's not too bad, but I'd like to make sure I'm not completely

misunderstanding something.

Thanks!

-- 

Nicolas Dandrimont

Backend Engineer, Software Heritage

BOFH excuse #170:

popper unable to process jumbo kernel

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com