Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Wed, 17 Feb 2021 23:35:15 +0000

On Wed, Feb 17, 2021 at 05:36:53PM +0100, Loïc Dachary wrote:
> Bonjour,
> 
> TL;DR: Is it more advisable to work on Ceph internals to make it
> friendly to this particular workload or write something similar to
> EOS[0] (i.e Rocksdb + Xrootd + RBD)?
CERN's EOSPPC instance, which is one of the biggest from what I can
find, was up around 3.5B files in 2019; and you're proposing running 10B
files, so I don't know how EOS  will handle that. Maybe Dan can chime in
on the scalability there.

Please do keep on this important work! I've tried to do something
similar at a much smaller scale for Gentoo Linux's historical collection
of source code media (distfiles), but am significantly further behind
your effort.

> Let say those 10 billions objects are stored in a single 4+2 erasure
> coded pool with bluestore compression set for objects that have a size
> 32KB and the smallest allocation size for bluestore set to 4KB[3].
> The 750TB won't use the expected 350TB but about 30% more, i.e.
> ~450TB (see [4] for the maths). This space amplification is because
> storing a 1 byte object uses the same space as storing a 16KB object
> (see [5] to repeat the experience at home). In a 4+2 erasure coded
> pool, each of the 6 chunks will use no less than 4KB because that's
> the smallest allocation size for bluestore. That's 4 * 4KB = 16KB
> even when all that is needed is 1 byte.
I think you have an error here: with 4KB allocation size in 4+2 pool,
any object sized (0,16K] will take _6_ chunks: 20KB of storage.
Any object sized (16K,32K] will take _12_ chunks: 40K of storage.

I'd attack this from another side entirely:
- how aggressively do you want to pack objects overall? e.g. if you have
  a few thousand objects in the 4-5K range, do you want zero bytes
  wasted between objects?
- how aggressively do you want to dudup objects that share common data,
  esp if it's not aligned on some common byte margins?
- what are the data portability requirements to move/extract data from
  this system at a later point?
- how complex of an index are you willing to maintain to
  reconstruct/access data?
- What requirements are there about the ordering and accessibility of
  the packs? How related do the pack objects need to be? e.g. are the
  packed as they arrive in time order, to build up successive packs of
  size, or are there many packs and you append the "correct" pack for a
  given object?

I'm normally distinctly in the camp that object storage systems should
natively expose all objects, but that also doesn't account for your
immutability/append-only nature.

I see your discussion at https://forge.softwareheritage.org/T3054#58977
as well, about the "full scale out" vs "scale up metadata & scale out
data" parts.

To brainstorm parts of an idea, I'm wondering about Git's
still-in-development partial clone work, with the caveat that you intend
to NEVER checkout the entire repository at the same time.

Ideally, using some manner of fuse filesystem (similar to Git Virtual
Filesystem) w/ an index-only clone, naive clients could access the
object they wanted, which would be fetched on demand from the git server
which has mostly git packs and a few sparse objects that are waiting for
packing.

The write path on ingest clients would involve sending back the new
data, and git background processes on some regular interval packing the
loose objects into new packfiles.

Running this on top of CephFS for now means that you get the ability to
move it to future storage systems more easily than any custom RBD/EOS
development you might do: bring up enough space, sync the files over,
profit.

Git handles the deduplication, compression, access methods, and
generates large pack files, which Ceph can store more optimally than the
plethora of tiny objects.

Overall, this isn't great, but there aren't a lot of alternatives as
your great research has noted.

Being able to take a backup of the Git-on-CephFS is also made a lot
easier since it's a filesystem: "just" write out the 350TB to 20x LTO-9
tapes

Thinking back to older systems, like SGI's hierarchal storage modules
for XFS, the packing overhead starts to become significant for your
objects: some of the underlying mechanisms in the XFS HSM DMAPI, if they
ended up packing immutable objects to tape still had tar & tar-like
headers (at least 512 bytes per object), your 10B objects would take at
least 4TB of extra space (before compression).

> It was suggested[6] to have two different pools: one with a 4+2 erasure pool and compression for all objects with a size > 32KB that are expected to compress to 16KB. And another with 3 replicas for the smaller objects to reduce space amplification to a minimum without compromising on durability. A client looking for the object could make two simultaneous requests to the two pools. They would get 404 from one of them and the object from the other.
> 
> Another workaround, is best described in the "Finding a needle in Haystack: Facebook’s photo storage"[9] paper and essentially boils down to using a database to store a map between the object name and its location. That does not scale out (writing the database index is the bottleneck) but it's simple enough and is successfully implemented in EOS[0] with >200PB worth of data and in seaweedfs[10], another promising object store software based on the same idea.
> 
> Instead of working around the problem, maybe Ceph could be modified to make better use of the immutability of these objects[7], a hint that is apparently only used to figure out how to best compress it and for checksum calculation[8]. I honestly have not clue how difficult it would be. All I know is that it's not easy otherwise it would have been done already: there seem to be a general need for efficiently (space wise and performance wise) storing large quantities of objects smaller than 4KB.
> 
> Is it more advisable to:
> 
>   * work on Ceph internals to make it friendly to this particular workload or,
>   * write another implementation of "Finding a needle in Haystack: Facebook’s photo storage"[9] based on RBD[11]?
> 
> I'm currently leaning toward working on Ceph internals but there are pros and cons to both approaches[12]. And since all this is still very new to me, there also is the possibility that I'm missing something. Maybe it's *super* difficult  to improve Ceph in this way. I should try to figure that out sooner rather than later.
> 
> I realize it's a lot to take in and unless you're facing the exact same problem there is very little chance you read that far :-) But if you did... I'm *really* interested to hear what yout think. In any case I'll report back to this thread once a decision has been made.
> 
> Cheers
> 
> [0] https://eos-web.web.cern.ch/eos-web/
> [1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/
> [2] https://forge.softwareheritage.org/T3054
> [3] https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330
> [4] https://forge.softwareheritage.org/T3052#58864
> [5] https://forge.softwareheritage.org/T3052#58917
> [6] https://forge.softwareheritage.org/T3052#58876
> [7] https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE
> [8] https://forge.softwareheritage.org/T3055
> [9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
> [10] https://github.com/chrislusf/seaweedfs/wiki/Components
> [11] https://forge.softwareheritage.org/T3049
> [12] https://forge.softwareheritage.org/T3054#58977
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 
> 

> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robbat2@xxxxxxxxxx
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx