Re: Jewel -> Luminous upgrade, package install stopped all daemons

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Fri, 15 Sep 2017 15:06:30 -0700

On Fri, Sep 15, 2017 at 2:10 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
> I'm glad that worked for you to finish the upgrade.
>
> He has multiple MONs, but all of them are on nodes with OSDs as well.  When
> he updated the packages on the first node, it restarted the MON and all of
> the OSDs.  This is strictly not supported in the Luminous upgrade as the
> OSDs can't be running Luminous code until all of the MONs are running
> Luminous.  I have never seen updating Ceph packages cause a restart of the
> daemons because you need to schedule the restarts and wait until the cluster
> is back to healthy before restarting the next node to upgrade the daemons.
> If upgrading the packages is causing a restart of the Ceph daemons, it is
> most definitely a bug and needs to be fixed.

The current spec file tell that unless CEPH_AUTO_RESTART_ON_UPGRADE is
set to "yes", it shoudn't restart, but I remember
it does restart in my own testing as well. Although I see no harm
since the underlying binaries have changed and for the cluster
in redundant mode restarting of service shoudn't cause any issue. But
maybe its still useful for some use cases.

>
> On Fri, Sep 15, 2017 at 4:48 PM David <dclistslinux@xxxxxxxxx> wrote:
>>
>> Happy to report I got everything up to Luminous, used your tip to keep the
>> OSDs running, David, thanks again for that.
>>
>> I'd say this is a potential gotcha for people collocating MONs. It appears
>> that if you're running selinux, even in permissive mode, upgrading the
>> ceph-selinux packages forces a restart on all the OSDs. You're left with a
>> load of OSDs down that you can't start as you don't have a Luminous mon
>> quorum yet.
>>
>>
>> On 15 Sep 2017 4:54 p.m., "David" <dclistslinux@xxxxxxxxx> wrote:
>>
>> Hi David
>>
>> I like your thinking! Thanks for the suggestion. I've got a maintenance
>> window later to finish the update so will give it a try.
>>
>>
>> On Thu, Sep 14, 2017 at 6:24 PM, David Turner <drakonstein@xxxxxxxxx>
>> wrote:
>>>
>>> This isn't a great solution, but something you could try.  If you stop
>>> all of the daemons via systemd and start them all in a screen as a manually
>>> running daemon in the foreground of each screen... I don't think that yum
>>> updating the packages can stop or start the daemons.  You could copy and
>>> paste the running command (viewable in ps) to know exactly what to run in
>>> the screens to start the daemons like this.
>>>
>>> On Wed, Sep 13, 2017 at 6:53 PM David <dclistslinux@xxxxxxxxx> wrote:
>>>>
>>>> Hi All
>>>>
>>>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
>>>> smoothly.
>>>>
>>>> I've attempted to upgrade on a small production cluster but I've hit a
>>>> snag.
>>>>
>>>> After installing the ceph 12.2.0 packages with "yum install ceph" on the
>>>> first node and accepting all the dependencies, I found that all the OSD
>>>> daemons, the MON and the MDS running on that node were terminated. Systemd
>>>> appears to have attempted to restart them all but the daemons didn't start
>>>> successfully (not surprising as first stage of upgrading all mons in cluster
>>>> not completed). I was able to start the MON and it's running. The OSDs are
>>>> all down and I'm reluctant to attempt to start them without upgrading the
>>>> other MONs in the cluster. I'm also reluctant to attempt upgrading the
>>>> remaining 2 MONs without understanding what happened.
>>>>
>>>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
>>>> Both clusters running on CentOS 7.3
>>>>
>>>> The only obvious difference I can see between the dev and production is
>>>> the production has selinux running in permissive mode, the dev had it
>>>> disabled.
>>>>
>>>> Any advice on how to proceed at this point would be much appreciated.
>>>> The cluster is currently functional, but I have 1 node out 4 with all OSDs
>>>> down. I had noout set before the upgrade and I've left it set for now.
>>>>
>>>> Here's the journalctl right after the packages were installed (hostname
>>>> changed):
>>>>
>>>> https://pastebin.com/fa6NMyjG
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com