I still prefer the simplest solution. There are 4U servers with 110 x 20TB disks on the market. After raid you get 1.5PiB per server. This is 30 months of data. 2 such servers will hold 5 years of data with minimal problems. If you need backup; then buy 2 more sets and just send zfs snapshot diffs to this set. On Wed, Feb 17, 2021 at 11:15 PM Loïc Dachary <loic@xxxxxxxxxxx> wrote: > > > > On 17/02/2021 18:27, Serkan Çoban wrote: > > Why not put all the data to a zfs pool with 3-4 levels deep directory > > structure each directory named with 2 byte in range 00-FF? > > Four levels deep, you get 255^4=4B folders with 3-4 objects per folder > > or three levels deep you get 255^3=16M folders with ~1000 objects > > each. > It is more or less the current setup :-) I should have mentioned that there currently are ~750TB and 10 billions objects. But it's growing by 50TB every month and it will keep growing indefinitely. Reason why a solution that scales out is desirable. > > > > On Wed, Feb 17, 2021 at 8:14 PM Loïc Dachary <loic@xxxxxxxxxxx> wrote: > >> Hi Nathan, > >> > >> Good thinking :-) The names of the objects are indeed the SHA256 of their content, which provides deduplication. > >> > >> Cheers > >> > >> On 17/02/2021 18:04, Nathan Fish wrote: > >>> I'm not much of a programmer, but as soon as I hear "immutable > >>> objects" I think "content-addressed". I don't know if you have many > >>> duplicate objects in this set, but content-addressing gives you > >>> object-level dedup for free. Do you have to preserve some meaningful > >>> object names from the original dataset, or just do you just need some > >>> kind of ID? > >>> > >>> On Wed, Feb 17, 2021 at 11:37 AM Loïc Dachary <loic@xxxxxxxxxxx> wrote: > >>>> Bonjour, > >>>> > >>>> TL;DR: Is it more advisable to work on Ceph internals to make it friendly to this particular workload or write something similar to EOS[0] (i.e Rocksdb + Xrootd + RBD)? > >>>> > >>>> This is a followup of two previous mails[1] sent while researching this topic. In a nutshell, the Software Heritage project[1] currently has ~750TB and 10 billions objects, 75% of which have a size smaller than 16KB and 50% have a size smaller than 4KB. But they only account for ~5% of the 750TB: 25% of the objects have a size > 16KB and total ~700TB. The objects can be compressed by ~50% and 750TB only needs 350TB of actual storage. (if you're interested in the details see [2]). > >>>> > >>>> Let say those 10 billions objects are stored in a single 4+2 erasure coded pool with bluestore compression set for objects that have a size > 32KB and the smallest allocation size for bluestore set to 4KB[3]. The 750TB won't use the expected 350TB but about 30% more, i.e. ~450TB (see [4] for the maths). This space amplification is because storing a 1 byte object uses the same space as storing a 16KB object (see [5] to repeat the experience at home). In a 4+2 erasure coded pool, each of the 6 chunks will use no less than 4KB because that's the smallest allocation size for bluestore. That's 4 * 4KB = 16KB even when all that is needed is 1 byte. > >>>> > >>>> It was suggested[6] to have two different pools: one with a 4+2 erasure pool and compression for all objects with a size > 32KB that are expected to compress to 16KB. And another with 3 replicas for the smaller objects to reduce space amplification to a minimum without compromising on durability. A client looking for the object could make two simultaneous requests to the two pools. They would get 404 from one of them and the object from the other. > >>>> > >>>> Another workaround, is best described in the "Finding a needle in Haystack: Facebook’s photo storage"[9] paper and essentially boils down to using a database to store a map between the object name and its location. That does not scale out (writing the database index is the bottleneck) but it's simple enough and is successfully implemented in EOS[0] with >200PB worth of data and in seaweedfs[10], another promising object store software based on the same idea. > >>>> > >>>> Instead of working around the problem, maybe Ceph could be modified to make better use of the immutability of these objects[7], a hint that is apparently only used to figure out how to best compress it and for checksum calculation[8]. I honestly have not clue how difficult it would be. All I know is that it's not easy otherwise it would have been done already: there seem to be a general need for efficiently (space wise and performance wise) storing large quantities of objects smaller than 4KB. > >>>> > >>>> Is it more advisable to: > >>>> > >>>> * work on Ceph internals to make it friendly to this particular workload or, > >>>> * write another implementation of "Finding a needle in Haystack: Facebook’s photo storage"[9] based on RBD[11]? > >>>> > >>>> I'm currently leaning toward working on Ceph internals but there are pros and cons to both approaches[12]. And since all this is still very new to me, there also is the possibility that I'm missing something. Maybe it's *super* difficult to improve Ceph in this way. I should try to figure that out sooner rather than later. > >>>> > >>>> I realize it's a lot to take in and unless you're facing the exact same problem there is very little chance you read that far :-) But if you did... I'm *really* interested to hear what yout think. In any case I'll report back to this thread once a decision has been made. > >>>> > >>>> Cheers > >>>> > >>>> [0] https://eos-web.web.cern.ch/eos-web/ > >>>> [1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ > >>>> [2] https://forge.softwareheritage.org/T3054 > >>>> [3] https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330 > >>>> [4] https://forge.softwareheritage.org/T3052#58864 > >>>> [5] https://forge.softwareheritage.org/T3052#58917 > >>>> [6] https://forge.softwareheritage.org/T3052#58876 > >>>> [7] https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE > >>>> [8] https://forge.softwareheritage.org/T3055 > >>>> [9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf > >>>> [10] https://github.com/chrislusf/seaweedfs/wiki/Components > >>>> [11] https://forge.softwareheritage.org/T3049 > >>>> [12] https://forge.softwareheritage.org/T3054#58977 > >>>> > >>>> -- > >>>> Loïc Dachary, Artisan Logiciel Libre > >>>> > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> -- > >> Loïc Dachary, Artisan Logiciel Libre > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > Loïc Dachary, Artisan Logiciel Libre > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx