Re: Efficient storage of small objects / bulk erasure coding

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 23 Oct 2017 07:51:18 +0000

On Mon, Oct 23, 2017 at 9:37 AM Jiri Horky <jiri.horky@xxxxxxxxx> wrote:

    Hi Greg,

    On 10/17/2017 11:49 PM, Gregory Farnum
      wrote:

      On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky <jiri.horky@xxxxxxxxx>
        wrote:

          Hi list,

            we are thinking of building relatively big CEPH-based object
            storage for

            storage of our sample files - we have about 700M files
            ranging from very

            small (1-4KiB) files to pretty big ones (several GiB).
            Median of file

            size is 64KiB. Since the required space is relatively large
            (1PiB of

            usable storage), we are thinking of utilizing erasure coding
            for this

            case. On the other hand, we need to achieve at least
            1200MiB/s

            throughput on reads. The working assumption is 4+2 EC (thus
            50% overhead).

            Since the EC is per-object, the small objects will be
            stripped to even

            smaller ones. With 4+2 EC, one needs (at least) 4 IOs to
            read a single

            object in this scenario -> number of required IOPS when
            using EC is

            relatively high. Some vendors (such as Hitachi, but I
            believe EMC as

            well) do offline, predefined-chunk size EC instead. The idea
            is to first

            write objects with replication factor of 3, wait for enough
            objects to

            fill 4x 64MiB chunks and only do EC on that. This not only
            makes the EC

            less computationally intensive, and repairs much faster, but
            it also

            allows reading majority of the small objects directly by
            reading just

            part of one of the chunk from it (assuming non degraded
            state) - one

            chunk actually contains the whole object.

            I wonder if something similar is already possible with CEPH
            and/or is

            planned. For our use case of very small objects, it would
            mean near 3-4x

            performance boosts in terms of required IOPS performance.

            Another option how to get out of this situation is to be
            able to specify

            different storage pools/policies based on file size - i.e.
            to do 3x

            replication of the very small files and only use EC for
            bigger files,

            where the performance hit with 4x IOPS won't be that
            painful. But I I am

            afraid this is not possible...

          Unfortunately any logic like this would need to be
            handled in your application layer. Raw RADOS does not do
            object sharding or aggregation on its own.
          CERN did contribute the libradosstriper, which will break
            down your multi-gigabyte objects into more typical sizes,
            but a generic system for packing many small objects into
            larger ones is tough — the choices depend so much on likely
            access patterns and such.

          I would definitely recommend working out something like
            that, though!
          -Greg

    this is unfortunate. I believe that for storage of small objects,
    this would be a deal breaker. Hitachi claims they can do 20+6
    erasure coding when using predefined-size EC, which is something
    hardly imaginable on the current CEPH implementation. Actually, for
    us, I am afraid that lack of this feature actually mean we would buy
    an object store instead of building it on open source technology :-/

    From technical side, I don't see why the access pattern of such
    objects would change the storage strategy. If you would leave the
    bulk blocksize configurable, it should be enough, shouldn't it?

Well, there's two different things. If you're doing replicated writes and then erasure coding data, you assume the data changes slowly enough for that to work, or at least that the cost of erasure coding it is worthwhile.

That's not a bad bet, but the RADOS architecture simply doesn't support doing anything like that internally; all decisions about replication versus erasure coding and data placement happen on the level of a pool, not on objects inside of them. So bulk packing of objects isn't really possible for RADOS to do on its own, and the application has to drive any data movement. That requires understanding patterns to select the right coding chunks (so that objects tend to exist in one chunk), to know when is a good time to physically read and write the data, etc.

This use case you're describing is certainly useful, but so far as I know it's not implemented in any open-source storage solutions because it's pretty specialized and requires a lot of backend investment that doesn't pay off incrementally.
-Greg

    Regards

    Jiri Horky

            Any other hint is sincerely welcome.

            Thank you

            Jiri Horky

            _______________________________________________

            ceph-users mailing list

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com