Re: Alternate Multi-MDS Upgrade Procedure

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Mon, 23 May 2022 09:27:07 -0400

Slightly different, yes. It should be configurable (and not the default).

On Thu, May 19, 2022 at 1:13 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> This would mean implementing a different MDS upgrade/restart sequence in cephadm, right?
>
> On Thu, May 19, 2022, 9:47 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
>>
>> Hi Dan,
>>
>> On Wed, May 18, 2022 at 10:44 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
>> >
>> > Dear CephFS devs,
>> >
>> > We noticed a new warning about multi-MDS upgrades linked at the top of
>> > https://docs.ceph.com/en/latest/releases/pacific/#upgrading-from-octopus-or-nautilus
>> > (the relevant tracker and PR are at
>> > https://tracker.ceph.com/issues/53597 and
>> > https://github.com/ceph/ceph/pull/44335)
>> >
>> > Motivated by that issue, and by our operational experience, I'd like
>> > to propose standardizing a multi-MDS upgrade procedure which does not
>> > require decreasing to max_mds 1.
>> >
>> > 1. First, note that I'm aware that the current upgrade procedure is
>> > there because mds-to-mds comms are not versioned, so all MDSs need to
>> > run the same version. So I realize we cannot restart each MDS one by
>> > one.
>> >
>> > 2. Based on our operational experience, clusters that have a workload
>> > requiring several active MDS's can't easily decrease to 1 mds -- the
>> > single MDS can't handle the full metadata load, not to mention the
>> > extra export / import work needed to decrease from many to one MDS.
>> > (Our largest CephFS has 4 MDSs each with 100GB active cache and are
>> > all busy -- decreasing to 1 mds would be highly disruptive to our
>> > users).
>> >
>> > 3. With Patrick's agreement, when we upgraded that cluster from
>> > Nautilus to Octopus several months ago we *did not* decrease to a
>> > single active MDS: We upgraded the rpms on all MDSs, then stopped all
>> > actives and standbys, then started them all. The "downtime" was
>> > roughly equivalent to restarting a single MDS. This should be easily
>> > orchestratable via cephadm.
>> >
>> > What do you think about validating, testing, documenting (3) ?
>> > IMHO this would make large CephFS cluster upgrades much less scary!
>>
>> My only concern is that this upgrade procedure is not really tested at
>> all  (beyond your experience). We should of course make sure this is
>> well tested in teuthology before it can be any kind of default upgrade
>> behavior. I've created a ticket here:
>> https://tracker.ceph.com/issues/55715
>>
>> --
>> Patrick Donnelly, Ph.D.
>> He / Him / His
>> Principal Software Engineer
>> Red Hat, Inc.
>> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>>
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx