Re: Efficient storage of small objects / bulk erasure coding

Jiri Horky <jiri.horky@xxxxxxxxx> · Mon, 23 Oct 2017 15:16:43 +0200

Hi John,

On 10/23/2017 02:59 PM, John Spray wrote:
> On Tue, Oct 17, 2017 at 9:42 PM, Jiri Horky <jiri.horky@xxxxxxxxx> wrote:
>> Hi list,
>>
>> we are thinking of building relatively big CEPH-based object storage for
>> storage of our sample files - we have about 700M files ranging from very
>> small (1-4KiB) files to pretty big ones (several GiB). Median of file
>> size is 64KiB. Since the required space is relatively large (1PiB of
>> usable storage), we are thinking of utilizing erasure coding for this
>> case. On the other hand, we need to achieve at least 1200MiB/s
>> throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).
>>
>> Since the EC is per-object, the small objects will be stripped to even
>> smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
>> object in this scenario -> number of required IOPS when using EC is
>> relatively high. Some vendors (such as Hitachi, but I believe EMC as
>> well) do offline, predefined-chunk size EC instead. The idea is to first
>> write objects with replication factor of 3, wait for enough objects to
>> fill 4x 64MiB chunks and only do EC on that. This not only makes the EC
>> less computationally intensive, and repairs much faster, but it also
>> allows reading majority of the small objects directly by reading just
>> part of one of the chunk from it (assuming non degraded state) - one
>> chunk actually contains the whole object.
> How does the client know the name of the larger/bulk object, given the
> name of one of the small objects within it?  Presumably, there is some
> index?
The point is that the client does not need to care. The bulking for more
efficient EC storage is done by underlying object store/storage system.
So the clients access objects the ordinary way, whereas the storage
layer takes care of tracking in which EC bulk the individual object is
stored. I understood this is completely different thinking the what
RADOS nowadays uses.
>
>> I wonder if something similar is already possible with CEPH and/or is
>> planned. For our use case of very small objects, it would mean near 3-4x
>> performance boosts in terms of required IOPS performance.
>>
>> Another option how to get out of this situation is to be able to specify
>> different storage pools/policies based on file size - i.e. to do 3x
>> replication of the very small files and only use EC for bigger files,
>> where the performance hit with 4x IOPS won't be that painful. But I I am
>> afraid this is not possible...
> Surely there is nothing stopping you writing your small objects in one
> pool and your large objects in another?  Am I missing something?
Exept, that all the clients accessing the shared storage would need to
have that logic inside. It would be just better if I could make it
transparent for the clients.

Jiri Horky
>
> John
>
>> Any other hint is sincerely welcome.
>>
>> Thank you
>> Jiri Horky
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com