Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB

Loïc Dachary <loic@xxxxxxxxxxx> · Sat, 20 Feb 2021 16:29:43 +0100

Hi Robin,

On 18/02/2021 00:35, Robin H. Johnson wrote:
> On Wed, Feb 17, 2021 at 05:36:53PM +0100, Loïc Dachary wrote:
>> Bonjour,
>>
>> TL;DR: Is it more advisable to work on Ceph internals to make it
>> friendly to this particular workload or write something similar to
>> EOS[0] (i.e Rocksdb + Xrootd + RBD)?
> CERN's EOSPPC instance, which is one of the biggest from what I can
> find, was up around 3.5B files in 2019; and you're proposing running 10B
> files, so I don't know how EOS  will handle that. Maybe Dan can chime in
> on the scalability there.
This is an essential piece of information I was missing. It also makes sense that there are much larger objects in the context of the CERN.
>
> Please do keep on this important work! I've tried to do something
> similar at a much smaller scale for Gentoo Linux's historical collection
> of source code media (distfiles), but am significantly further behind
> your effort.
Thanks for the encouragements! These are very preliminary stages but I'm enthusiastic about what will follow because I'll have the opportunity to work on it until a solution is implemented and deployed.
>
>> Let say those 10 billions objects are stored in a single 4+2 erasure
>> coded pool with bluestore compression set for objects that have a size
>> 32KB and the smallest allocation size for bluestore set to 4KB[3].
>> The 750TB won't use the expected 350TB but about 30% more, i.e.
>> ~450TB (see [4] for the maths). This space amplification is because
>> storing a 1 byte object uses the same space as storing a 16KB object
>> (see [5] to repeat the experience at home). In a 4+2 erasure coded
>> pool, each of the 6 chunks will use no less than 4KB because that's
>> the smallest allocation size for bluestore. That's 4 * 4KB = 16KB
>> even when all that is needed is 1 byte.
> I think you have an error here: with 4KB allocation size in 4+2 pool,
> any object sized (0,16K] will take _6_ chunks: 20KB of storage.
> Any object sized (16K,32K] will take _12_ chunks: 40K of storage.
I should have mentioned that my calculations were ignoring the replication overhead (parity chunks or copies). Good catch :-)
>
> I'd attack this from another side entirely:
> - how aggressively do you want to pack objects overall? e.g. if you have
>   a few thousand objects in the 4-5K range, do you want zero bytes
>   wasted between objects?
50% of the objects have a size <4KB, that is ~5billions currently and growing. *But* they account for only 1% of the total size. So maybe not very agressively but not passively either.
> - how aggressively do you want to dudup objects that share common data,
>   esp if it's not aligned on some common byte margins?
Objects/files are addressed by the SHA256 of their content and that  takes care of deduplication.
> - what are the data portability requirements to move/extract data from
>   this system at a later point?
The data portability is ensured by using Free Software only and open standards where possible. And by distributing the software in a way that can be conveniently installed by a third party. Does that answer your question? The durability of the software/format couple used to store data is something I'm not worried about but may I should.
> - how complex of an index are you willing to maintain to
>   reconstruct/access data?
I don't envision the index being more complex than SHA256 => content (roughly).
> - What requirements are there about the ordering and accessibility of
>   the packs? How related do the pack objects need to be? e.g. are the
>   packed as they arrive in time order, to build up successive packs of
>   size, or are there many packs and you append the "correct" pack for a
>   given object?
There are no ordering requirements.
>
> I'm normally distinctly in the camp that object storage systems should
> natively expose all objects, but that also doesn't account for your
> immutability/append-only nature.
>
> I see your discussion at https://forge.softwareheritage.org/T3054#58977
> as well, about the "full scale out" vs "scale up metadata & scale out
> data" parts.
>
> To brainstorm parts of an idea, I'm wondering about Git's
> still-in-development partial clone work, 
[snip] I did not know about "partial clone" and will explore this in https://forge.softwareheritage.org/T3065. Although it is probably not a good fit for a 2021 solution, it sounds like a great source of inspiration.
> Thinking back to older systems, like SGI's hierarchal storage modules
> for XFS, the packing overhead starts to become significant for your
> objects: some of the underlying mechanisms in the XFS HSM DMAPI, if they
> ended up packing immutable objects to tape still had tar & tar-like
> headers (at least 512 bytes per object), your 10B objects would take at
> least 4TB of extra space (before compression).
I'm tempted to overlook lessons from the past. In part because I'm afraid I'll loose myself :-) In part because I assume the world changed a lot since. If however you think (have a hunch) that it might be useful, I'll give it a try.

Thanks for the great feedback!

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx