Re: Using RBD to pack billions of small files

Loïc Dachary <loic@xxxxxxxxxxx> · Tue, 2 Feb 2021 21:32:48 +0100

Hi Greg,

On 02/02/2021 20:34, Gregory Farnum wrote:
> Packing's obviously a good idea for storing these kinds of artifacts
> in Ceph, and hacking through the existing librbd might indeed be
> easier than building something up from raw RADOS, especially if you
> want to use stuff like rbd-mirror.
>
> My main concern would just be as Dan points out, that we don't test
> rbd with extremely large images and we know deleting that image will
> take a looooong time — I don't know of other issues off the top of my
> head, and in the worst case you could always fall back to manipulating
> it with raw librados if there is an issue.
Right. Dan's comment gave me pause: it does not seem to be
a good idea to assume a RBD image of an infinite size. A friend who read this
thread suggested a sensible approach (which also is in line with the
Haystack paper): instead of making a single gigantic image, make
multiple 1TB images. The index is bigger

SHA256 sum of the artifact => name/uuid of the 1TB image,offset,size

instead of

SHA256 sum of the artifact  => offset,size

But each image still provides packing for over 100 millions artifacts when the
average artifact size is around 3KB. It also allows:

* multiple writers (one for each image),
* rbd-mirroring individual 1TB images to a different Ceph cluster (challenging with a single 100TB+ image),
* copying a 1TB image from a pool with a given erasure code profile to another pool with a different profile,
* growing from 1TB to 2TB in the future by merging two 1TB images,
* etc.

> But you might also check in on the status of Danny Al-Gaaf's rados
> email project. Email and these artifacts seemingly have a lot in
> common.
They do. This is inspiring:

https://github.com/ceph-dovecot/dovecot-ceph-plugin
https://github.com/ceph-dovecot/dovecot-ceph-plugin/tree/master/src/librmb

Thanks for the pointer.

Cheers
> -Greg
>
> On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary <loic@xxxxxxxxxxx> wrote:
>> Hi Dan,
>>
>> On 01/02/2021 21:13, Dan van der Ster wrote:
>>> Hi Loïc,
>>>
>>> We've never managed 100TB+ in a single RBD volume. I can't think of
>>> anything, but perhaps there are some unknown limitations when they get so
>>> big.
>>> It should be easy enough to use rbd bench to create and fill a massive test
>>> image to validate everything works well at that size.
>> Good idea! I'll look for a cluster with 100TB of free space and post my findings.
>>> Also, I assume you'll be doing the IO from just one client? Multiple
>>> readers/writers to a single volume could get complicated.
>> Yes.
>>> Otherwise, yes RBD sounds very convenient for what you need.
>> It is inspired by https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf which suggests an ad-hoc implementation to pack immutable objects together. But I think RBD already provides the underlying logic, even though it is not specialized for this use case. RGW also packs small objects together and would be a good candidate. But it provides more flexibility to modify/delete objects and I assume it will be slower to write N objects with RGW than to write them sequentially on an RBD image. But I did not try and maybe I should.
>>
>> To be continued.
>>> Cheers, Dan
>>>
>>>
>>> On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic@xxxxxxxxxxx> wrote:
>>>
>>>> Bonjour,
>>>>
>>>> In the context Software Heritage (a noble mission to preserve all source
>>>> code)[0], artifacts have an average size of ~3KB and there are billions of
>>>> them. They never change and are never deleted. To save space it would make
>>>> sense to write them, one after the other, in an every growing RBD volume
>>>> (more than 100TB). An index, located somewhere else, would record the
>>>> offset and size of the artifacts in the volume.
>>>>
>>>> I wonder if someone already implemented this idea with success? And if
>>>> not... does anyone see a reason why it would be a bad idea?
>>>>
>>>> Cheers
>>>>
>>>> [0] https://docs.softwareheritage.org/
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx