On Wed, Feb 17, 2021 at 05:36:53PM +0100, Loïc Dachary wrote: > Bonjour, > > TL;DR: Is it more advisable to work on Ceph internals to make it > friendly to this particular workload or write something similar to > EOS[0] (i.e Rocksdb + Xrootd + RBD)? CERN's EOSPPC instance, which is one of the biggest from what I can find, was up around 3.5B files in 2019; and you're proposing running 10B files, so I don't know how EOS will handle that. Maybe Dan can chime in on the scalability there. Please do keep on this important work! I've tried to do something similar at a much smaller scale for Gentoo Linux's historical collection of source code media (distfiles), but am significantly further behind your effort. > Let say those 10 billions objects are stored in a single 4+2 erasure > coded pool with bluestore compression set for objects that have a size > 32KB and the smallest allocation size for bluestore set to 4KB[3]. > The 750TB won't use the expected 350TB but about 30% more, i.e. > ~450TB (see [4] for the maths). This space amplification is because > storing a 1 byte object uses the same space as storing a 16KB object > (see [5] to repeat the experience at home). In a 4+2 erasure coded > pool, each of the 6 chunks will use no less than 4KB because that's > the smallest allocation size for bluestore. That's 4 * 4KB = 16KB > even when all that is needed is 1 byte. I think you have an error here: with 4KB allocation size in 4+2 pool, any object sized (0,16K] will take _6_ chunks: 20KB of storage. Any object sized (16K,32K] will take _12_ chunks: 40K of storage. I'd attack this from another side entirely: - how aggressively do you want to pack objects overall? e.g. if you have a few thousand objects in the 4-5K range, do you want zero bytes wasted between objects? - how aggressively do you want to dudup objects that share common data, esp if it's not aligned on some common byte margins? - what are the data portability requirements to move/extract data from this system at a later point? - how complex of an index are you willing to maintain to reconstruct/access data? - What requirements are there about the ordering and accessibility of the packs? How related do the pack objects need to be? e.g. are the packed as they arrive in time order, to build up successive packs of size, or are there many packs and you append the "correct" pack for a given object? I'm normally distinctly in the camp that object storage systems should natively expose all objects, but that also doesn't account for your immutability/append-only nature. I see your discussion at https://forge.softwareheritage.org/T3054#58977 as well, about the "full scale out" vs "scale up metadata & scale out data" parts. To brainstorm parts of an idea, I'm wondering about Git's still-in-development partial clone work, with the caveat that you intend to NEVER checkout the entire repository at the same time. Ideally, using some manner of fuse filesystem (similar to Git Virtual Filesystem) w/ an index-only clone, naive clients could access the object they wanted, which would be fetched on demand from the git server which has mostly git packs and a few sparse objects that are waiting for packing. The write path on ingest clients would involve sending back the new data, and git background processes on some regular interval packing the loose objects into new packfiles. Running this on top of CephFS for now means that you get the ability to move it to future storage systems more easily than any custom RBD/EOS development you might do: bring up enough space, sync the files over, profit. Git handles the deduplication, compression, access methods, and generates large pack files, which Ceph can store more optimally than the plethora of tiny objects. Overall, this isn't great, but there aren't a lot of alternatives as your great research has noted. Being able to take a backup of the Git-on-CephFS is also made a lot easier since it's a filesystem: "just" write out the 350TB to 20x LTO-9 tapes Thinking back to older systems, like SGI's hierarchal storage modules for XFS, the packing overhead starts to become significant for your objects: some of the underlying mechanisms in the XFS HSM DMAPI, if they ended up packing immutable objects to tape still had tar & tar-like headers (at least 512 bytes per object), your 10B objects would take at least 4TB of extra space (before compression). > It was suggested[6] to have two different pools: one with a 4+2 erasure pool and compression for all objects with a size > 32KB that are expected to compress to 16KB. And another with 3 replicas for the smaller objects to reduce space amplification to a minimum without compromising on durability. A client looking for the object could make two simultaneous requests to the two pools. They would get 404 from one of them and the object from the other. > > Another workaround, is best described in the "Finding a needle in Haystack: Facebook’s photo storage"[9] paper and essentially boils down to using a database to store a map between the object name and its location. That does not scale out (writing the database index is the bottleneck) but it's simple enough and is successfully implemented in EOS[0] with >200PB worth of data and in seaweedfs[10], another promising object store software based on the same idea. > > Instead of working around the problem, maybe Ceph could be modified to make better use of the immutability of these objects[7], a hint that is apparently only used to figure out how to best compress it and for checksum calculation[8]. I honestly have not clue how difficult it would be. All I know is that it's not easy otherwise it would have been done already: there seem to be a general need for efficiently (space wise and performance wise) storing large quantities of objects smaller than 4KB. > > Is it more advisable to: > > * work on Ceph internals to make it friendly to this particular workload or, > * write another implementation of "Finding a needle in Haystack: Facebook’s photo storage"[9] based on RBD[11]? > > I'm currently leaning toward working on Ceph internals but there are pros and cons to both approaches[12]. And since all this is still very new to me, there also is the possibility that I'm missing something. Maybe it's *super* difficult to improve Ceph in this way. I should try to figure that out sooner rather than later. > > I realize it's a lot to take in and unless you're facing the exact same problem there is very little chance you read that far :-) But if you did... I'm *really* interested to hear what yout think. In any case I'll report back to this thread once a decision has been made. > > Cheers > > [0] https://eos-web.web.cern.ch/eos-web/ > [1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ > [2] https://forge.softwareheritage.org/T3054 > [3] https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330 > [4] https://forge.softwareheritage.org/T3052#58864 > [5] https://forge.softwareheritage.org/T3052#58917 > [6] https://forge.softwareheritage.org/T3052#58876 > [7] https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE > [8] https://forge.softwareheritage.org/T3055 > [9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf > [10] https://github.com/chrislusf/seaweedfs/wiki/Components > [11] https://forge.softwareheritage.org/T3049 > [12] https://forge.softwareheritage.org/T3054#58977 > > -- > Loïc Dachary, Artisan Logiciel Libre > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robbat2@xxxxxxxxxx GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx