Re: Ceph behavior on (lots of) small objects (RGW, RADOS + erasure coding)?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 28 Jun 2018 19:31:09 -0700

On Wed, Jun 27, 2018 at 2:32 AM Nicolas Dandrimont <olasd@xxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

I would like to use ceph to store a lot of small objects. Our current usage

pattern is 4.5 billion unique objects, ranging from 0 to 100MB, with a median

size of 3-4kB. Overall, that's around 350 TB of raw data to store, which isn't

much, but that's across a *lot* of tiny files.

We expect a growth pattern of around at third per year, and the object size

distribution to sensibly stay the same (it's been stable for the past three

years, and we don't see that changing).

Our object access pattern is a very simple key -> value store, where the key

happens to be the sha1 of the content we're storing. Any metadata are stored

externally and we really only need a dumb object storage.

Our redundancy requirement is to be able to withstand the loss of 2 OSDs.

After looking at our options for storage in Ceph, I dismissed (perhaps hastily)

RGW for its metadata overhead, and went straight to plain RADOS. I've setup an

erasure coded storage pool, with default settings, with k=5 and m=2 (expecting

a 40% increase in storage use over plain contents).

After storing objects in the pool, I see a storage usage of 700% instead of

140%. My understanding of the erasure code profile docs[1] is that objects that

are below the stripe width (k * stripe_unit, which in my case is 20KB) can't be

chunked for erasure coding, which makes RADOS fall back to plain object

copying, with k+m copies.

[1] http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile/

Is my understanding correct? Does anyone have experience with this kind of

storage workload in Ceph?

That’s close but not *quite* right. It’s not that Ceph will explicitly “fall back” to replication. In most (though perhaps not all) erasure codes, what you’ll see is full sized parity blocks, a full store of the data (in the default reed-Solomon that will just be full-sized chunks up to however many are needed to store it fully in a single copy), and the remaining data chunks (out of the k) will have no data. *But* Ceph will keep the “object info” metadata in each shard, so all the OSDs in a PG will still witness all the writes.

If my understanding is correct, I'll end up adding size tiering on my object

storage layer, shuffling objects in two pools with different settings according

to their size. That's not too bad, but I'd like to make sure I'm not completely

misunderstanding something.

That’s probably a reasonable response, especially if you are already maintaining an index for other purposes!
-Greg

Thanks!

-- 

Nicolas Dandrimont

Backend Engineer, Software Heritage

BOFH excuse #170:

popper unable to process jumbo kernel

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com