Re: Broke libvirt on compute node due to Ceph Luminous to Nautilus Upgrade

Pardhiv Karri <meher4india@xxxxxxxxx> · Tue, 18 Feb 2025 11:01:23 -0800

Hi Anthony,

Thank you for the reply. Here is the output from the monitor node. The
monitor (includes manager) and OSD nodes have been rebooted sequentially
after the upgrade to Nautilus, so I wonder why they are still
showing luminous now. Anyway I can fix?

or1sz2 [root@mon1 ~]# ceph features
{
    "mon": [
        {
            "features": "0x3ffddff8ffecffff",
            "release": "luminous",
            "num": 3
        }
    ],
    "osd": [
        {
            "features": "0x3ffddff8ffecffff",
            "release": "luminous",
            "num": 111
        }
    ],
    "client": [
        {
            "features": "0x3ffddff8ffecffff",
            "release": "luminous",
            "num": 322
        }
    ],
    "mgr": [
        {
            "features": "0x3ffddff8ffecffff",
            "release": "luminous",
            "num": 3
        }
    ]
}
or1sz2 [root@mon1 ~]# dpkg -l | grep -i ceph
ii  ceph                                  14.2.22-1xenial
         amd64        distributed storage and file system
ii  ceph-base                             14.2.22-1xenial
         amd64        common ceph daemon libraries and management tools
ii  ceph-common                           14.2.22-1xenial
         amd64        common utilities to mount and interact with a ceph
storage cluster
ii  ceph-deploy                           2.0.1
         all          Ceph-deploy is an easy to use configuration tool
ii  ceph-mgr                              14.2.22-1xenial
         amd64        manager for the ceph distributed storage system
ii  ceph-mon                              14.2.22-1xenial
         amd64        monitor server for the ceph storage system
ii  ceph-osd                              14.2.22-1xenial
         amd64        OSD server for the ceph storage system
rc  libcephfs1                            10.2.11-1trusty
         amd64        Ceph distributed file system client library
ii  libcephfs2                            14.2.22-1xenial
         amd64        Ceph distributed file system client library
ii  python-ceph-argparse                  14.2.22-1xenial
         all          Python 2 utility libraries for Ceph CLI
ii  python-cephfs                         14.2.22-1xenial
         amd64        Python 2 libraries for the Ceph libcephfs library
ii  python-rados                          14.2.22-1xenial
         amd64        Python 2 libraries for the Ceph librados library
ii  python-rbd                            14.2.22-1xenial
         amd64        Python 2 libraries for the Ceph librbd library
ii  python-rgw                            14.2.22-1xenial
         amd64        Python 2 libraries for the Ceph librgw library
or1sz2 [root@or1dra1300 ~]#

Thanks,
Pardhiv

On Tue, Feb 18, 2025 at 10:55 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
wrote:

> This is one of the pitfalls of package-based installs.  This dynamic with
> Nova and other virtualization systems has been well-known for at least a
> dozen years.
>
> I would not expect a Luminous client (i.e. librbd / librados) to have an
> issue, though — it should be able to handle pg-upmap.  If you have a
> reference indicating the need to update to the Nautilus client, please send
> it along.
>
> I wonder if you have clients that are actually older than Luminous, that
> could cause problems.
>
> Cf https://tracker.ceph.com/issues/13301
>
> Run `ceph features` which should give you client info.  An unfortunate
> wrinkle is that in the case of pg-upmap, some clients may report “jewel”
> but their feature bitmaps actually indicate compatibility with pg-upmap.
> If you see clients that are pre-Luminous, focus restarts and migrations on
> those.
>
> OpenStack components themselves sometimes have dependencies on Ceph
> versions, so I would look at those and at libvirt itself as well.
>
> On Feb 18, 2025, at 1:48 PM, Pardhiv Karri <meher4india@xxxxxxxxx> wrote:
>
> Hi,
>
> We recently upgraded our Ceph from Luminous to Nautilus and upgraded the
> ceph clients on OpenStack (using rbd). All went well and after a few days,
> we randomly saw instances getting stuck with libvirt_qemu_exporter, which
> is getting the libvirt stuck on Openstack compute nodes. We had to kill
> those instances process, and then libvirt is returning. But the issue is
> happening again on the compute nodes with other instances. Upon doing some
> research, I found that we need to migrate the instances to use the latest
> (nautilus) ceph client, as they still use the old(luminous) client when
> spun up. The only way to get them to have the Nautilus client is to live
> migrate or reboot. We have thousands of instances, and doing any of those
> takes a long time without impacting the customer. Is there any other fix to
> solve this issue without migrating or rebooting the instances?
>
> Error on compute hosts: (renamed host and instance id)
>
> Feb 18 00:08:00 cmp03 libvirtd[5362]: 2025-02-18 00:08:00.510+0000: 5627:
> warning : qemuDomainObjBeginJobInternal:4933 : Cannot start job (query,
> none) for domain instance-009141b8; current job is (query, none) owned by
> (5628 remoteDispatchDomainBlockStats, 0 <null>) for (322330s, 0s)
>
> Thanks,
> Pardhiv
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>

-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx