Re: Efficient storage of small objects / bulk erasure coding

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 17 Oct 2017 21:49:24 +0000

On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky <jiri.horky@xxxxxxxxx> wrote:
Hi list,

we are thinking of building relatively big CEPH-based object storage for

storage of our sample files - we have about 700M files ranging from very

small (1-4KiB) files to pretty big ones (several GiB). Median of file

size is 64KiB. Since the required space is relatively large (1PiB of

usable storage), we are thinking of utilizing erasure coding for this

case. On the other hand, we need to achieve at least 1200MiB/s

throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).

Since the EC is per-object, the small objects will be stripped to even

smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single

object in this scenario -> number of required IOPS when using EC is

relatively high. Some vendors (such as Hitachi, but I believe EMC as

well) do offline, predefined-chunk size EC instead. The idea is to first

write objects with replication factor of 3, wait for enough objects to

fill 4x 64MiB chunks and only do EC on that. This not only makes the EC

less computationally intensive, and repairs much faster, but it also

allows reading majority of the small objects directly by reading just

part of one of the chunk from it (assuming non degraded state) - one

chunk actually contains the whole object.

I wonder if something similar is already possible with CEPH and/or is

planned. For our use case of very small objects, it would mean near 3-4x

performance boosts in terms of required IOPS performance.

Another option how to get out of this situation is to be able to specify

different storage pools/policies based on file size - i.e. to do 3x

replication of the very small files and only use EC for bigger files,

where the performance hit with 4x IOPS won't be that painful. But I I am

afraid this is not possible...

Unfortunately any logic like this would need to be handled in your application layer. Raw RADOS does not do object sharding or aggregation on its own.
CERN did contribute the libradosstriper, which will break down your multi-gigabyte objects into more typical sizes, but a generic system for packing many small objects into larger ones is tough — the choices depend so much on likely access patterns and such.

I would definitely recommend working out something like that, though!
-Greg

Any other hint is sincerely welcome.

Thank you

Jiri Horky

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com