Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

Nino Kotur <ninokotur@xxxxxxxxx> · Fri, 16 Jun 2023 23:32:25 +0200

If you create new crush rule for ssd/nvme/hdd and attach it to existing
pool you should be able to do the migration seamlessly while everything is
online... However impact to user will depend on storage devices load and
network utilization as it will create chaos on cluster network.

Or did i get something wrong?

Kind regards,
Nino

On Wed, Jun 14, 2023 at 5:44 PM Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:

> Hi,
>
> further note to self and for posterity … ;)
>
> This turned out to be a no-go as well, because you can’t silently switch
> the pools to a different storage class: the objects will be found, but the
> index still refers to the old storage class and lifecycle migrations won’t
> work.
>
> I’ve brainstormed for further options and it appears that the last resort
> is to use placement targets and copy the buckets explicitly - twice,
> because on Nautilus I don’t have renames available, yet. :(
>
> This will require temporary downtimes prohibiting users to access their
> bucket. Fortunately we only have a few very large buckets (200T+) that will
> take a while to copy. We can pre-sync them of course, so the downtime will
> only be during the second copy.
>
> Christian
>
> > On 13. Jun 2023, at 14:52, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> >
> > Following up to myself and for posterity:
> >
> > I’m going to try to perform a switch here using (temporary) storage
> classes and renaming of the pools to ensure that I can quickly change the
> STANDARD class to a better EC pool and have new objects located there.
> After that we’ll add (temporary) lifecycle rules to all buckets to ensure
> their objects will be migrated to the STANDARD class.
> >
> > Once that is finished we should be able to delete the old pool and the
> temporary storage class.
> >
> > First tests appear successfull, but I’m a bit struggling to get the
> bucket rules working (apparently 0 days isn’t a real rule … and the debug
> interval setting causes high frequent LC runs but doesn’t seem move objects
> just yet. I’ll play around with that setting a bit more, though, I think I
> might have tripped something that only wants to process objects every so
> often and on an interval of 10 a day is still 2.4 hours …
> >
> > Cheers,
> > Christian
> >
> >> On 9. Jun 2023, at 11:16, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> we are running a cluster that has been alive for a long time and we
> tread carefully regarding updates. We are still a bit lagging and our
> cluster (that started around Firefly) is currently at Nautilus. We’re
> updating and we know we’re still behind, but we do keep running into
> challenges along the way that typically are still unfixed on main and - as
> I started with - have to tread carefully.
> >>
> >> Nevertheless, mistakes happen, and we found ourselves in this
> situation: we converted our RGW data pool from replicated (n=3) to erasure
> coded (k=10, m=3, with 17 hosts) but when doing the EC profile selection we
> missed that our hosts are not evenly balanced (this is a growing cluster
> and some machines have around 20TiB capacity for the RGW data pool, wheres
> newer machines have around 160TiB and we rather should have gone with k=4,
> m=3.  In any case, having 13 chunks causes too many hosts to participate in
> each object. Going for k+m=7 will allow distribution to be more effective
> as we have 7 hosts that have the 160TiB sizing.
> >>
> >> Our original migration used the “cache tiering” approach, but that only
> works once when moving from replicated to EC and can not be used for
> further migrations.
> >>
> >> The amount of data is at 215TiB somewhat significant, so using an
> approach that scales when copying data[1] to avoid ending up with months of
> migration.
> >>
> >> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it
> on a rados/pool level) and I guess we can only fix this on an application
> level using multi-zone replication.
> >>
> >> I have the setup nailed in general, but I’m running into issues with
> buckets in our staging and production environment that have
> `explicit_placement` pools attached, AFAICT is this an outdated mechanisms
> but there are no migration tools around. I’ve seen some people talk about
> patched versions of the `radosgw-admin metadata put` variant that (still)
> prohibits removing explicit placements.
> >>
> >> AFAICT those explicit placements will be synced to the secondary zone
> and the effect that I’m seeing underpins that theory: the sync runs for a
> while and only a few hundred objects show up in the new zone, as the
> buckets/objects are already found in the old pool that the new zone uses
> due to the explicit placement rule.
> >>
> >> I’m currently running out of ideas, but open for any other options.
> >>
> >> Looking at
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z
> I’m wondering whether the relevant patch is available somewhere, or whether
> I’ll have to try building that patch again on my own.
> >>
> >> Going through the docs and the code I’m actually wondering whether
> `explicit_placement` is actually a really crufty residual piece that won’t
> get used in newer clusters but older clusters don’t really have an option
> to get away from?
> >>
> >> In my specific case, the placement rules are identical to the explicit
> placements that are stored on (apparently older) buckets and the only thing
> I need to do is to remove them. I can accept a bit of downtime to avoid any
> race conditions if needed, so maybe having a small tool to just remove
> those entries while all RGWs are down would be fine. A call to
> `radosgw-admin bucket stat` takes about 18s for all buckets in production
> and I guess that would be a good comparison for what timing to expect when
> running an update on the metadata.
> >>
> >> I’ll also be in touch with colleagues from Heinlein and 42on but I’m
> open to other suggestions.
> >>
> >> Hugs,
> >> Christian
> >>
> >> [1] We currently have 215TiB data in 230M objects. Using the “official”
> “cache-flush-evict-all” approach was unfeasible here as it only yielded
> around 50MiB/s. Using cache limits and targetting the cache sizes to 0
> caused proper parallelization and was able to flush/evict at almost
> constant 1GiB/s in the cluster.
> >>
> >>
> >> --
> >> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> >> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian
> Zagrodnick
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> > Liebe Grüße,
> > Christian Theune
> >
> > --
> > Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> > Flying Circus Internet Operations GmbH · https://flyingcircus.io
> > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian
> Zagrodnick
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> Liebe Grüße,
> Christian Theune
>
> --
> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian
> Zagrodnick
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx