Re: Ceph behavior on (lots of) small objects (RGW, RADOS + erasure coding)?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 27, 2018 at 2:32 AM Nicolas Dandrimont <olasd@xxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

I would like to use ceph to store a lot of small objects. Our current usage
pattern is 4.5 billion unique objects, ranging from 0 to 100MB, with a median
size of 3-4kB. Overall, that's around 350 TB of raw data to store, which isn't
much, but that's across a *lot* of tiny files.

We expect a growth pattern of around at third per year, and the object size
distribution to sensibly stay the same (it's been stable for the past three
years, and we don't see that changing).

Our object access pattern is a very simple key -> value store, where the key
happens to be the sha1 of the content we're storing. Any metadata are stored
externally and we really only need a dumb object storage.

Our redundancy requirement is to be able to withstand the loss of 2 OSDs.

After looking at our options for storage in Ceph, I dismissed (perhaps hastily)
RGW for its metadata overhead, and went straight to plain RADOS. I've setup an
erasure coded storage pool, with default settings, with k=5 and m=2 (expecting
a 40% increase in storage use over plain contents).

After storing objects in the pool, I see a storage usage of 700% instead of
140%. My understanding of the erasure code profile docs[1] is that objects that
are below the stripe width (k * stripe_unit, which in my case is 20KB) can't be
chunked for erasure coding, which makes RADOS fall back to plain object
copying, with k+m copies.

[1] http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile/

Is my understanding correct? Does anyone have experience with this kind of
storage workload in Ceph?

That’s close but not *quite* right. It’s not that Ceph will explicitly “fall back” to replication. In most (though perhaps not all) erasure codes, what you’ll see is full sized parity blocks, a full store of the data (in the default reed-Solomon that will just be full-sized chunks up to however many are needed to store it fully in a single copy), and the remaining data chunks (out of the k) will have no data. *But* Ceph will keep the “object info” metadata in each shard, so all the OSDs in a PG will still witness all the writes.



If my understanding is correct, I'll end up adding size tiering on my object
storage layer, shuffling objects in two pools with different settings according
to their size. That's not too bad, but I'd like to make sure I'm not completely
misunderstanding something.

That’s probably a reasonable response, especially if you are already maintaining an index for other purposes!
-Greg



Thanks!
--
Nicolas Dandrimont
Backend Engineer, Software Heritage

BOFH excuse #170:
popper unable to process jumbo kernel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux