Hi Greg, On 02/02/2021 20:34, Gregory Farnum wrote: > Packing's obviously a good idea for storing these kinds of artifacts > in Ceph, and hacking through the existing librbd might indeed be > easier than building something up from raw RADOS, especially if you > want to use stuff like rbd-mirror. > > My main concern would just be as Dan points out, that we don't test > rbd with extremely large images and we know deleting that image will > take a looooong time — I don't know of other issues off the top of my > head, and in the worst case you could always fall back to manipulating > it with raw librados if there is an issue. Right. Dan's comment gave me pause: it does not seem to be a good idea to assume a RBD image of an infinite size. A friend who read this thread suggested a sensible approach (which also is in line with the Haystack paper): instead of making a single gigantic image, make multiple 1TB images. The index is bigger SHA256 sum of the artifact => name/uuid of the 1TB image,offset,size instead of SHA256 sum of the artifact => offset,size But each image still provides packing for over 100 millions artifacts when the average artifact size is around 3KB. It also allows: * multiple writers (one for each image), * rbd-mirroring individual 1TB images to a different Ceph cluster (challenging with a single 100TB+ image), * copying a 1TB image from a pool with a given erasure code profile to another pool with a different profile, * growing from 1TB to 2TB in the future by merging two 1TB images, * etc. > But you might also check in on the status of Danny Al-Gaaf's rados > email project. Email and these artifacts seemingly have a lot in > common. They do. This is inspiring: https://github.com/ceph-dovecot/dovecot-ceph-plugin https://github.com/ceph-dovecot/dovecot-ceph-plugin/tree/master/src/librmb Thanks for the pointer. Cheers > -Greg > > On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary <loic@xxxxxxxxxxx> wrote: >> Hi Dan, >> >> On 01/02/2021 21:13, Dan van der Ster wrote: >>> Hi Loïc, >>> >>> We've never managed 100TB+ in a single RBD volume. I can't think of >>> anything, but perhaps there are some unknown limitations when they get so >>> big. >>> It should be easy enough to use rbd bench to create and fill a massive test >>> image to validate everything works well at that size. >> Good idea! I'll look for a cluster with 100TB of free space and post my findings. >>> Also, I assume you'll be doing the IO from just one client? Multiple >>> readers/writers to a single volume could get complicated. >> Yes. >>> Otherwise, yes RBD sounds very convenient for what you need. >> It is inspired by https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf which suggests an ad-hoc implementation to pack immutable objects together. But I think RBD already provides the underlying logic, even though it is not specialized for this use case. RGW also packs small objects together and would be a good candidate. But it provides more flexibility to modify/delete objects and I assume it will be slower to write N objects with RGW than to write them sequentially on an RBD image. But I did not try and maybe I should. >> >> To be continued. >>> Cheers, Dan >>> >>> >>> On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic@xxxxxxxxxxx> wrote: >>> >>>> Bonjour, >>>> >>>> In the context Software Heritage (a noble mission to preserve all source >>>> code)[0], artifacts have an average size of ~3KB and there are billions of >>>> them. They never change and are never deleted. To save space it would make >>>> sense to write them, one after the other, in an every growing RBD volume >>>> (more than 100TB). An index, located somewhere else, would record the >>>> offset and size of the artifacts in the volume. >>>> >>>> I wonder if someone already implemented this idea with success? And if >>>> not... does anyone see a reason why it would be a bad idea? >>>> >>>> Cheers >>>> >>>> [0] https://docs.softwareheritage.org/ >>>> >>>> -- >>>> Loïc Dachary, Artisan Logiciel Libre >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> -- >> Loïc Dachary, Artisan Logiciel Libre >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx