Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

Christian Theune <ct@xxxxxxxxxxxxxxx> · Wed, 21 Jun 2023 15:58:15 +0200

Aaaand another dead end: there is too much meta-data involved (bucket and object ACLs, lifecycle, policy, …) that won’t be possible to perfectly migrate. Also, lifecycles _might_ be affected if mtimes change.

So, I’m going to try and go back to a single-cluster multi-zone setup. For that I’m going to change all buckets with explicit placements to remove the explicit placement markers (those were created from old versions of Ceph and weren’t intentional by us, they perfectly reflect the default placement configuration).

Here’s the patch I’m going to try on top of our Nautilus branch now:
https://github.com/flyingcircusio/ceph/commit/b3a317987e50f089efc4e9694cf6e3d5d9c23bd5

All our buckets with explicit placements conform perfectly to the default placement, so this seems safe.

Otherwise Zone migration was perfect until I noticed the objects with explicit placements in our staging and production clusters. (The dev cluster seems to have been purged intermediately, so this wasn’t noticed).

I’m actually wondering whether explicit placements are really a sensible thing to have, even in multi-cluster multi-zone setups. AFAICT due to realms you might end up with different zonegroups referring to the same pools and this should only run through proper abstractions … o_O

Cheers,
Christian

> On 14. Jun 2023, at 17:42, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> 
> Hi,
> 
> further note to self and for posterity … ;)
> 
> This turned out to be a no-go as well, because you can’t silently switch the pools to a different storage class: the objects will be found, but the index still refers to the old storage class and lifecycle migrations won’t work.
> 
> I’ve brainstormed for further options and it appears that the last resort is to use placement targets and copy the buckets explicitly - twice, because on Nautilus I don’t have renames available, yet. :( 
> 
> This will require temporary downtimes prohibiting users to access their bucket. Fortunately we only have a few very large buckets (200T+) that will take a while to copy. We can pre-sync them of course, so the downtime will only be during the second copy.
> 
> Christian
> 
>> On 13. Jun 2023, at 14:52, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
>> 
>> Following up to myself and for posterity:
>> 
>> I’m going to try to perform a switch here using (temporary) storage classes and renaming of the pools to ensure that I can quickly change the STANDARD class to a better EC pool and have new objects located there. After that we’ll add (temporary) lifecycle rules to all buckets to ensure their objects will be migrated to the STANDARD class.
>> 
>> Once that is finished we should be able to delete the old pool and the temporary storage class.
>> 
>> First tests appear successfull, but I’m a bit struggling to get the bucket rules working (apparently 0 days isn’t a real rule … and the debug interval setting causes high frequent LC runs but doesn’t seem move objects just yet. I’ll play around with that setting a bit more, though, I think I might have tripped something that only wants to process objects every so often and on an interval of 10 a day is still 2.4 hours … 
>> 
>> Cheers,
>> Christian
>> 
>>> On 9. Jun 2023, at 11:16, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> Hi,
>>> 
>>> we are running a cluster that has been alive for a long time and we tread carefully regarding updates. We are still a bit lagging and our cluster (that started around Firefly) is currently at Nautilus. We’re updating and we know we’re still behind, but we do keep running into challenges along the way that typically are still unfixed on main and - as I started with - have to tread carefully.
>>> 
>>> Nevertheless, mistakes happen, and we found ourselves in this situation: we converted our RGW data pool from replicated (n=3) to erasure coded (k=10, m=3, with 17 hosts) but when doing the EC profile selection we missed that our hosts are not evenly balanced (this is a growing cluster and some machines have around 20TiB capacity for the RGW data pool, wheres newer machines have around 160TiB and we rather should have gone with k=4, m=3.  In any case, having 13 chunks causes too many hosts to participate in each object. Going for k+m=7 will allow distribution to be more effective as we have 7 hosts that have the 160TiB sizing.
>>> 
>>> Our original migration used the “cache tiering” approach, but that only works once when moving from replicated to EC and can not be used for further migrations.
>>> 
>>> The amount of data is at 215TiB somewhat significant, so using an approach that scales when copying data[1] to avoid ending up with months of migration.
>>> 
>>> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a rados/pool level) and I guess we can only fix this on an application level using multi-zone replication.
>>> 
>>> I have the setup nailed in general, but I’m running into issues with buckets in our staging and production environment that have `explicit_placement` pools attached, AFAICT is this an outdated mechanisms but there are no migration tools around. I’ve seen some people talk about patched versions of the `radosgw-admin metadata put` variant that (still) prohibits removing explicit placements.
>>> 
>>> AFAICT those explicit placements will be synced to the secondary zone and the effect that I’m seeing underpins that theory: the sync runs for a while and only a few hundred objects show up in the new zone, as the buckets/objects are already found in the old pool that the new zone uses due to the explicit placement rule.
>>> 
>>> I’m currently running out of ideas, but open for any other options.
>>> 
>>> Looking at https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z I’m wondering whether the relevant patch is available somewhere, or whether I’ll have to try building that patch again on my own.
>>> 
>>> Going through the docs and the code I’m actually wondering whether `explicit_placement` is actually a really crufty residual piece that won’t get used in newer clusters but older clusters don’t really have an option to get away from?
>>> 
>>> In my specific case, the placement rules are identical to the explicit placements that are stored on (apparently older) buckets and the only thing I need to do is to remove them. I can accept a bit of downtime to avoid any race conditions if needed, so maybe having a small tool to just remove those entries while all RGWs are down would be fine. A call to `radosgw-admin bucket stat` takes about 18s for all buckets in production and I guess that would be a good comparison for what timing to expect when running an update on the metadata.
>>> 
>>> I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to other suggestions.
>>> 
>>> Hugs,
>>> Christian
>>> 
>>> [1] We currently have 215TiB data in 230M objects. Using the “official” “cache-flush-evict-all” approach was unfeasible here as it only yielded around 50MiB/s. Using cache limits and targetting the cache sizes to 0 caused proper parallelization and was able to flush/evict at almost constant 1GiB/s in the cluster. 
>>> 
>>> 
>>> -- 
>>> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> Liebe Grüße,
>> Christian Theune
>> 
>> -- 
>> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> Liebe Grüße,
> Christian Theune
> 
> -- 
> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

Liebe Grüße,
Christian Theune

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx