Re: Broke libvirt on compute node due to Ceph Luminous to Nautilus Upgrade

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Tue, 18 Feb 2025 15:15:40 -0500

Gotcha.  I’ve been through Grizzly > Havana > Icehouse migrations in the past, Ceph from Dumpling > Hammer in conjunction, so Queens is like way futuristic to me ;)

I suspect that you will — however painfully — want to do the grand migrate-or-reboot shuffle in advance of the OpenStack migration regardless, so that clients linked with older libraries don’t experience additional and potentially impactful issues.  Like the issues that older clients had when the mons / mon IP addresses changed.  Maybe ask your users to reboot their own instances in a certain timeframe to debulk what you have to migrate.

Nautilus itself is EOL from the upstream Ceph perspective, but of course you need to align with the needs and compatibility of your OpenStack.

I hope `nova migrate` uses the scheduler now, I filed a bug report years ago after that rude awakening.

—aad

> On Feb 18, 2025, at 2:07 PM, Pardhiv Karri <meher4india@xxxxxxxxx> wrote:
> 
> Hi Anthony,
> 
> Regarding the need to upgrade Ceph, we are upgrading our current OpenStack
> from Queens (yeah, very old) to Antelope and the openstack vendor required
> us to upgrade Ceph from Luminous to Nautilus for their migration code to
> work as the framework they are using to migrate/upgrade only works with
> Nautilus and above.
> 
> --Pardhiv
> 
> 
> On Tue, Feb 18, 2025 at 11:01 AM Pardhiv Karri <meher4india@xxxxxxxxx>
> wrote:
> 
>> Hi Anthony,
>> 
>> Thank you for the reply. Here is the output from the monitor node. The
>> monitor (includes manager) and OSD nodes have been rebooted sequentially
>> after the upgrade to Nautilus, so I wonder why they are still
>> showing luminous now. Anyway I can fix?
>> 
>> or1sz2 [root@mon1 ~]# ceph features
>> {
>>    "mon": [
>>        {
>>            "features": "0x3ffddff8ffecffff",
>>            "release": "luminous",
>>            "num": 3
>>        }
>>    ],
>>    "osd": [
>>        {
>>            "features": "0x3ffddff8ffecffff",
>>            "release": "luminous",
>>            "num": 111
>>        }
>>    ],
>>    "client": [
>>        {
>>            "features": "0x3ffddff8ffecffff",
>>            "release": "luminous",
>>            "num": 322
>>        }
>>    ],
>>    "mgr": [
>>        {
>>            "features": "0x3ffddff8ffecffff",
>>            "release": "luminous",
>>            "num": 3
>>        }
>>    ]
>> }
>> or1sz2 [root@mon1 ~]# dpkg -l | grep -i ceph
>> ii  ceph                                  14.2.22-1xenial
>>           amd64        distributed storage and file system
>> ii  ceph-base                             14.2.22-1xenial
>>           amd64        common ceph daemon libraries and management tools
>> ii  ceph-common                           14.2.22-1xenial
>>           amd64        common utilities to mount and interact with a ceph
>> storage cluster
>> ii  ceph-deploy                           2.0.1
>>           all          Ceph-deploy is an easy to use configuration tool
>> ii  ceph-mgr                              14.2.22-1xenial
>>           amd64        manager for the ceph distributed storage system
>> ii  ceph-mon                              14.2.22-1xenial
>>           amd64        monitor server for the ceph storage system
>> ii  ceph-osd                              14.2.22-1xenial
>>           amd64        OSD server for the ceph storage system
>> rc  libcephfs1                            10.2.11-1trusty
>>           amd64        Ceph distributed file system client library
>> ii  libcephfs2                            14.2.22-1xenial
>>           amd64        Ceph distributed file system client library
>> ii  python-ceph-argparse                  14.2.22-1xenial
>>           all          Python 2 utility libraries for Ceph CLI
>> ii  python-cephfs                         14.2.22-1xenial
>>           amd64        Python 2 libraries for the Ceph libcephfs library
>> ii  python-rados                          14.2.22-1xenial
>>           amd64        Python 2 libraries for the Ceph librados library
>> ii  python-rbd                            14.2.22-1xenial
>>           amd64        Python 2 libraries for the Ceph librbd library
>> ii  python-rgw                            14.2.22-1xenial
>>           amd64        Python 2 libraries for the Ceph librgw library
>> or1sz2 [root@or1dra1300 ~]#
>> 
>> Thanks,
>> Pardhiv
>> 
>> 
>> 
>> 
>> On Tue, Feb 18, 2025 at 10:55 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
>> wrote:
>> 
>>> This is one of the pitfalls of package-based installs.  This dynamic with
>>> Nova and other virtualization systems has been well-known for at least a
>>> dozen years.
>>> 
>>> I would not expect a Luminous client (i.e. librbd / librados) to have an
>>> issue, though — it should be able to handle pg-upmap.  If you have a
>>> reference indicating the need to update to the Nautilus client, please send
>>> it along.
>>> 
>>> I wonder if you have clients that are actually older than Luminous, that
>>> could cause problems.
>>> 
>>> Cf https://tracker.ceph.com/issues/13301
>>> 
>>> Run `ceph features` which should give you client info.  An unfortunate
>>> wrinkle is that in the case of pg-upmap, some clients may report “jewel”
>>> but their feature bitmaps actually indicate compatibility with pg-upmap.
>>> If you see clients that are pre-Luminous, focus restarts and migrations on
>>> those.
>>> 
>>> OpenStack components themselves sometimes have dependencies on Ceph
>>> versions, so I would look at those and at libvirt itself as well.
>>> 
>>> On Feb 18, 2025, at 1:48 PM, Pardhiv Karri <meher4india@xxxxxxxxx> wrote:
>>> 
>>> Hi,
>>> 
>>> We recently upgraded our Ceph from Luminous to Nautilus and upgraded the
>>> ceph clients on OpenStack (using rbd). All went well and after a few days,
>>> we randomly saw instances getting stuck with libvirt_qemu_exporter, which
>>> is getting the libvirt stuck on Openstack compute nodes. We had to kill
>>> those instances process, and then libvirt is returning. But the issue is
>>> happening again on the compute nodes with other instances. Upon doing some
>>> research, I found that we need to migrate the instances to use the latest
>>> (nautilus) ceph client, as they still use the old(luminous) client when
>>> spun up. The only way to get them to have the Nautilus client is to live
>>> migrate or reboot. We have thousands of instances, and doing any of those
>>> takes a long time without impacting the customer. Is there any other fix
>>> to
>>> solve this issue without migrating or rebooting the instances?
>>> 
>>> Error on compute hosts: (renamed host and instance id)
>>> 
>>> Feb 18 00:08:00 cmp03 libvirtd[5362]: 2025-02-18 00:08:00.510+0000: 5627:
>>> warning : qemuDomainObjBeginJobInternal:4933 : Cannot start job (query,
>>> none) for domain instance-009141b8; current job is (query, none) owned by
>>> (5628 remoteDispatchDomainBlockStats, 0 <null>) for (322330s, 0s)
>>> 
>>> Thanks,
>>> Pardhiv
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>>> 
>>> 
>> 
>> --
>> *Pardhiv Karri*
>> "Rise and Rise again until LAMBS become LIONS"
>> 
>> 
>> 
> 
> -- 
> *Pardhiv Karri*
> "Rise and Rise again until LAMBS become LIONS"
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx