Re: How do you handle large Ceph object storage cluster?

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Thu, 19 Oct 2023 18:37:30 +0100

> [...] (>10k OSDs, >60 PB of data).

6TBs on average per OSD? Hopully SSDs or RAID10 (or low-number,
3-5) RAID5.

> It is entirely dedicated to object storage with S3 interface.
> Maintenance and its extension are getting more and more
> problematic and time consuming.

Ah the joys of a single large unified storage pool :-).
https://www.sabi.co.uk/blog/0804apr.html?080417#080417

> We consider to split it to two or more completely separate
> clusters

I would suggest doing it 1-2 years ago...

> create S3 layer of abstraction with some additional metadata
> that will allow us to use these 2+ physically independent
> instances as a one logical cluster.

That's what the bucket hierarchy in a Ceph cluster instance
already does. What your layer is going to do is either:

 1) Lookup the object ID in a list of instances, and fetch the
    object from the instance that validates the object ID;
 2) Maintain a huge table of all object IDs and which instances
    they are in.

But 1) is basically what CRUSH already does and 2) means giving
up the Ceph "decentralized" philosophy based on CRUSH.

BTW one old practice that so few systems follow is to use as
object keys neither addresses nor identifiers, but *both*: first
access the address treating it as a hint, check that the
identifier matches, if not do a slower lookup using the object
identifier part to find the actual address.

> Additionally, newest data is the most demanded data, so we
> have to spread it equally among clusters to avoid skews in
> cluster load.

I usually do the opposite, but that depends on your application.

My practice is to recognize that data is indeed usually
stratified by date, and regard filesystem instances as "silos"
and create a new filesystems instance every some months or
years, and direct all new file creation to the latest instance,
and then get rid progressively of the older instances or copy
their "active" data onwards into the new instance, and the
"inactive" data to offline storage.
http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b

If you really need to keep all data forever online, which is
usually not the case (that's why there are laws that expire
matters after N years) the second best option is to keep old
silos powered up indefinitely, and they will take very little
attention beyond refreshing the hardware periodically and
migrating the data to new instances when that stops being
economical.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx