Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB

Loïc Dachary <loic@xxxxxxxxxxx> · Sat, 20 Feb 2021 16:41:27 +0100



On 18/02/2021 17:55, Dan van der Ster wrote:
> On Thu, Feb 18, 2021 at 12:36 AM Robin H. Johnson <robbat2@xxxxxxxxxx> wrote:
>> On Wed, Feb 17, 2021 at 05:36:53PM +0100, Loïc Dachary wrote:
>>> Bonjour,
>>>
>>> TL;DR: Is it more advisable to work on Ceph internals to make it
>>> friendly to this particular workload or write something similar to
>>> EOS[0] (i.e Rocksdb + Xrootd + RBD)?
>> CERN's EOSPPC instance, which is one of the biggest from what I can
>> find, was up around 3.5B files in 2019; and you're proposing running 10B
>> files, so I don't know how EOS  will handle that. Maybe Dan can chime in
>> on the scalability there.
> The EOS namespace is now QuarkDB https://github.com/gbitzes/QuarkDB
> But even with a clever namespace I don't think it is practical to
> manage a system with 10B tiny files.
> Enumerating them for a consistency check or migrating between hosts or
> recovering from failures is going to be painful.
> Pack them...
>
> -- Dan
Thanks for the update and the wise advice Dan :-)
>
>
>
>> Please do keep on this important work! I've tried to do something
>> similar at a much smaller scale for Gentoo Linux's historical collection
>> of source code media (distfiles), but am significantly further behind
>> your effort.
>>
>>> Let say those 10 billions objects are stored in a single 4+2 erasure
>>> coded pool with bluestore compression set for objects that have a size
>>> 32KB and the smallest allocation size for bluestore set to 4KB[3].
>>> The 750TB won't use the expected 350TB but about 30% more, i.e.
>>> ~450TB (see [4] for the maths). This space amplification is because
>>> storing a 1 byte object uses the same space as storing a 16KB object
>>> (see [5] to repeat the experience at home). In a 4+2 erasure coded
>>> pool, each of the 6 chunks will use no less than 4KB because that's
>>> the smallest allocation size for bluestore. That's 4 * 4KB = 16KB
>>> even when all that is needed is 1 byte.
>> I think you have an error here: with 4KB allocation size in 4+2 pool,
>> any object sized (0,16K] will take _6_ chunks: 20KB of storage.
>> Any object sized (16K,32K] will take _12_ chunks: 40K of storage.
>>
>> I'd attack this from another side entirely:
>> - how aggressively do you want to pack objects overall? e.g. if you have
>>   a few thousand objects in the 4-5K range, do you want zero bytes
>>   wasted between objects?
>> - how aggressively do you want to dudup objects that share common data,
>>   esp if it's not aligned on some common byte margins?
>> - what are the data portability requirements to move/extract data from
>>   this system at a later point?
>> - how complex of an index are you willing to maintain to
>>   reconstruct/access data?
>> - What requirements are there about the ordering and accessibility of
>>   the packs? How related do the pack objects need to be? e.g. are the
>>   packed as they arrive in time order, to build up successive packs of
>>   size, or are there many packs and you append the "correct" pack for a
>>   given object?
>>
>> I'm normally distinctly in the camp that object storage systems should
>> natively expose all objects, but that also doesn't account for your
>> immutability/append-only nature.
>>
>> I see your discussion at https://forge.softwareheritage.org/T3054#58977
>> as well, about the "full scale out" vs "scale up metadata & scale out
>> data" parts.
>>
>> To brainstorm parts of an idea, I'm wondering about Git's
>> still-in-development partial clone work, with the caveat that you intend
>> to NEVER checkout the entire repository at the same time.
>>
>> Ideally, using some manner of fuse filesystem (similar to Git Virtual
>> Filesystem) w/ an index-only clone, naive clients could access the
>> object they wanted, which would be fetched on demand from the git server
>> which has mostly git packs and a few sparse objects that are waiting for
>> packing.
>>
>> The write path on ingest clients would involve sending back the new
>> data, and git background processes on some regular interval packing the
>> loose objects into new packfiles.
>>
>> Running this on top of CephFS for now means that you get the ability to
>> move it to future storage systems more easily than any custom RBD/EOS
>> development you might do: bring up enough space, sync the files over,
>> profit.
>>
>> Git handles the deduplication, compression, access methods, and
>> generates large pack files, which Ceph can store more optimally than the
>> plethora of tiny objects.
>>
>> Overall, this isn't great, but there aren't a lot of alternatives as
>> your great research has noted.
>>
>> Being able to take a backup of the Git-on-CephFS is also made a lot
>> easier since it's a filesystem: "just" write out the 350TB to 20x LTO-9
>> tapes
>>
>> Thinking back to older systems, like SGI's hierarchal storage modules
>> for XFS, the packing overhead starts to become significant for your
>> objects: some of the underlying mechanisms in the XFS HSM DMAPI, if they
>> ended up packing immutable objects to tape still had tar & tar-like
>> headers (at least 512 bytes per object), your 10B objects would take at
>> least 4TB of extra space (before compression).
>>
>>
>>> It was suggested[6] to have two different pools: one with a 4+2 erasure pool and compression for all objects with a size > 32KB that are expected to compress to 16KB. And another with 3 replicas for the smaller objects to reduce space amplification to a minimum without compromising on durability. A client looking for the object could make two simultaneous requests to the two pools. They would get 404 from one of them and the object from the other.
>>>
>>> Another workaround, is best described in the "Finding a needle in Haystack: Facebook’s photo storage"[9] paper and essentially boils down to using a database to store a map between the object name and its location. That does not scale out (writing the database index is the bottleneck) but it's simple enough and is successfully implemented in EOS[0] with >200PB worth of data and in seaweedfs[10], another promising object store software based on the same idea.
>>>
>>> Instead of working around the problem, maybe Ceph could be modified to make better use of the immutability of these objects[7], a hint that is apparently only used to figure out how to best compress it and for checksum calculation[8]. I honestly have not clue how difficult it would be. All I know is that it's not easy otherwise it would have been done already: there seem to be a general need for efficiently (space wise and performance wise) storing large quantities of objects smaller than 4KB.
>>>
>>> Is it more advisable to:
>>>
>>>   * work on Ceph internals to make it friendly to this particular workload or,
>>>   * write another implementation of "Finding a needle in Haystack: Facebook’s photo storage"[9] based on RBD[11]?
>>>
>>> I'm currently leaning toward working on Ceph internals but there are pros and cons to both approaches[12]. And since all this is still very new to me, there also is the possibility that I'm missing something. Maybe it's *super* difficult  to improve Ceph in this way. I should try to figure that out sooner rather than later.
>>>
>>> I realize it's a lot to take in and unless you're facing the exact same problem there is very little chance you read that far :-) But if you did... I'm *really* interested to hear what yout think. In any case I'll report back to this thread once a decision has been made.
>>>
>>> Cheers
>>>
>>> [0] https://eos-web.web.cern.ch/eos-web/
>>> [1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/
>>> [2] https://forge.softwareheritage.org/T3054
>>> [3] https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330
>>> [4] https://forge.softwareheritage.org/T3052#58864
>>> [5] https://forge.softwareheritage.org/T3052#58917
>>> [6] https://forge.softwareheritage.org/T3052#58876
>>> [7] https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE
>>> [8] https://forge.softwareheritage.org/T3055
>>> [9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
>>> [10] https://github.com/chrislusf/seaweedfs/wiki/Components
>>> [11] https://forge.softwareheritage.org/T3049
>>> [12] https://forge.softwareheritage.org/T3054#58977
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>>
>>
>>
>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>> --
>> Robin Hugh Johnson
>> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
>> E-Mail   : robbat2@xxxxxxxxxx
>> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
>> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Loïc Dachary, Artisan Logiciel Libre


Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx