RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

Christian Theune <ct@xxxxxxxxxxxxxxx> · Fri, 9 Jun 2023 11:16:05 +0200

Hi,

we are running a cluster that has been alive for a long time and we tread carefully regarding updates. We are still a bit lagging and our cluster (that started around Firefly) is currently at Nautilus. We’re updating and we know we’re still behind, but we do keep running into challenges along the way that typically are still unfixed on main and - as I started with - have to tread carefully.

Nevertheless, mistakes happen, and we found ourselves in this situation: we converted our RGW data pool from replicated (n=3) to erasure coded (k=10, m=3, with 17 hosts) but when doing the EC profile selection we missed that our hosts are not evenly balanced (this is a growing cluster and some machines have around 20TiB capacity for the RGW data pool, wheres newer machines have around 160TiB and we rather should have gone with k=4, m=3.  In any case, having 13 chunks causes too many hosts to participate in each object. Going for k+m=7 will allow distribution to be more effective as we have 7 hosts that have the 160TiB sizing.

Our original migration used the “cache tiering” approach, but that only works once when moving from replicated to EC and can not be used for further migrations.

The amount of data is at 215TiB somewhat significant, so using an approach that scales when copying data[1] to avoid ending up with months of migration.

I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a rados/pool level) and I guess we can only fix this on an application level using multi-zone replication.

I have the setup nailed in general, but I’m running into issues with buckets in our staging and production environment that have `explicit_placement` pools attached, AFAICT is this an outdated mechanisms but there are no migration tools around. I’ve seen some people talk about patched versions of the `radosgw-admin metadata put` variant that (still) prohibits removing explicit placements.

AFAICT those explicit placements will be synced to the secondary zone and the effect that I’m seeing underpins that theory: the sync runs for a while and only a few hundred objects show up in the new zone, as the buckets/objects are already found in the old pool that the new zone uses due to the explicit placement rule.

I’m currently running out of ideas, but open for any other options.

Looking at https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z I’m wondering whether the relevant patch is available somewhere, or whether I’ll have to try building that patch again on my own.

Going through the docs and the code I’m actually wondering whether `explicit_placement` is actually a really crufty residual piece that won’t get used in newer clusters but older clusters don’t really have an option to get away from?

In my specific case, the placement rules are identical to the explicit placements that are stored on (apparently older) buckets and the only thing I need to do is to remove them. I can accept a bit of downtime to avoid any race conditions if needed, so maybe having a small tool to just remove those entries while all RGWs are down would be fine. A call to `radosgw-admin bucket stat` takes about 18s for all buckets in production and I guess that would be a good comparison for what timing to expect when running an update on the metadata.

I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to other suggestions.

Hugs,
Christian

[1] We currently have 215TiB data in 230M objects. Using the “official” “cache-flush-evict-all” approach was unfeasible here as it only yielded around 50MiB/s. Using cache limits and targetting the cache sizes to 0 caused proper parallelization and was able to flush/evict at almost constant 1GiB/s in the cluster. 

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx