RBD pool not starting "An error occurred, but the cause is unknown"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I have an issue getting a RBD pool going on a newly deployed compute node.

The storage back-end is a Ceph storage cluster running Ceph 14 (Nautilus… yes I know this is old, an update to 18 is planned soon). I have an existing node, running Debian 10 (again, updating this is planned, but I'd like to deploy new nodes to migrate the instances to whilst this node is updated), which runs about a dozen VMs with disks on this back-end.

I've loaded a new machine (a MSI Cubi 5 mini PC) up with AlpineLinux 3.21. Boot disk is a 240GB SATA SSD, and there's a 1TB nVME for local VM storage. My intent is to allow VMs to mount RBDs for back-up purposes. The machine has two Ethernet interfaces (a 2.5Gbps and a 1Gbps link), one will be the "front-end" used by the VMs, the other will be a "back-end" link to talk to Ceph and administer the host.

- OpenVSwitch 2.17.11 is deployed with two bridges
- libvirtd 10.9.0 installed
- a LVM pool called 'data' has been created on nVME
- Ceph 19.2.0 is installed (libvirtd is linked to this version of librbd)
- /etc/ceph has been cloned from my existing working compute node

I have two RBD pools; 'one' and 'ha'. 'one' has most of my virtual machine images in it (it is from a former OpenNebula install), 'ha' has core router root disk images in it ('ha' for high availability; it has stronger replication settings than 'one' to guarantee better reliability).

I've created a `libvirt` user in Ceph, and on the intended node, this works:

~ # rbd --id libvirt ls -p one | head
mastodon-vda
mastodon-vdb
mastodon-vdd
mastodon-vde
one-14
one-15
one-19
one-20
one-22
one-23
~ # rbd --id libvirt ls -p ha | head
core-router-obsd75-vda
core-router-obsd76-vda

I can also access RBD images just fine:
~ # rbd --id libvirt map one/shares-vda
/dev/rbd0
~ # fdisk -l /dev/rbd0
Disk /dev/rbd0: 20 GB, 21474836480 bytes, 41943040 sectors
2610 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes

Device    Boot StartCHS    EndCHS        StartLBA     EndLBA    Sectors  Size Id Type
/dev/rbd0p1 *  2,0,33      611,8,56          2048     616447     614400  300M 83 Linux
/dev/rbd0p2    611,8,57    1023,15,63      616448    2584575    1968128  961M 82 Linux swap
/dev/rbd0p3    1023,15,63  1023,15,63     2584576   41943039   39358464 18.7G 83 Linux
~ # rbd unmap one/shares-vda

This is registered in libvirtd:

~ # virsh secret-list
 UUID                                   Usage
--------------------------------------------------------------------
 c14a16b5-bba5-473a-ae9b-53a9a6b0a4e3   ceph client.libvirt secret
~ # virsh secret-dumpxml c14a16b5-bba5-473a-ae9b-53a9a6b0a4e3
<secret ephemeral='no' private='no'>
  <uuid>c14a16b5-bba5-473a-ae9b-53a9a6b0a4e3</uuid>
  <usage type='ceph'>
    <name>client.libvirt secret</name>
  </usage>
</secret>

I have defined four pools, 'temp', 'local', 'ha-images' and 'opennebula-images':
~ # virsh pool-list --all
 Name                State      Autostart
-------------------------------------------
 default             active     yes
 ha-images           active     yes
 local               active     yes
 opennebula-images   inactive   yes
 temp                active     yes

'ha-images' works just fine, this is its config:
~ # virsh pool-dumpxml ha-images
<pool type='rbd'>
  <name>ha-images</name>
  <uuid>6beab982-52b3-495b-a4a7-ab7ebb522ef5</uuid>
  <capacity unit='bytes'>20003977953280</capacity>
  <allocation unit='bytes'>159339114496</allocation>
  <available unit='bytes'>13142248669184</available>
  <source>
    <host name='172.31.252.1' port='6789'/>
    <host name='172.31.252.2' port='6789'/>
    <host name='172.31.252.5' port='6789'/>
    <host name='172.31.252.6' port='6789'/>
    <host name='172.31.252.7' port='6789'/>
    <host name='172.31.252.8' port='6789'/>
    <host name='172.31.252.9' port='6789'/>
    <host name='172.31.252.10' port='6789'/>
    <name>ha</name>
    <auth type='ceph' username='libvirt'>
      <secret uuid='c14a16b5-bba5-473a-ae9b-53a9a6b0a4e3'/>
    </auth>
  </source>
</pool>

'opennebula-images' does not, its config:
~ # virsh pool-dumpxml opennebula-images
<pool type='rbd'>
  <name>opennebula-images</name>
  <uuid>fcaa2fa8-f0d2-4919-9168-756a9f4ad7ee</uuid>
  <capacity unit='bytes'>20003977953280</capacity>
  <allocation unit='bytes'>5454371495936</allocation>
  <available unit='bytes'>13142254759936</available>
  <source>
    <host name='172.31.252.1' port='6789'/>
    <host name='172.31.252.2' port='6789'/>
    <host name='172.31.252.5' port='6789'/>
    <host name='172.31.252.6' port='6789'/>
    <host name='172.31.252.7' port='6789'/>
    <host name='172.31.252.8' port='6789'/>
    <host name='172.31.252.9' port='6789'/>
    <host name='172.31.252.10' port='6789'/>
    <name>one</name>
    <auth type='ceph' username='libvirt'>
      <secret uuid='c14a16b5-bba5-473a-ae9b-53a9a6b0a4e3'/>
    </auth>
  </source>
</pool>

It's not obvious what the differences are. `name`, `uuid`, `allocation`, `available` and `source/name` are expected to be different, everything else 100% matches. I've tried removing and zeroing out the `capacity`, `allocation` and `available` tags to no effect.

~ # virsh pool-dumpxml ha-images > /tmp/ha-images.xml
~ # virsh pool-dumpxml opennebula-images > /tmp/opennebula-images.xml
~ # diff -u /tmp/ha-images.xml /tmp/opennebula-images.xml --- /tmp/ha-images.xml
+++ /tmp/opennebula-images.xml
@@ -1,9 +1,9 @@
 <pool type='rbd'>
-  <name>ha-images</name>
-  <uuid>6beab982-52b3-495b-a4a7-ab7ebb522ef5</uuid>
+  <name>opennebula-images</name>
+  <uuid>fcaa2fa8-f0d2-4919-9168-756a9f4ad7ee</uuid>
   <capacity unit='bytes'>20003977953280</capacity>
-  <allocation unit='bytes'>159339114496</allocation>
-  <available unit='bytes'>13142248669184</available>
+  <allocation unit='bytes'>5454371495936</allocation>
+  <available unit='bytes'>13142254759936</available>
   <source>
     <host name='172.31.252.1' port='6789'/>
     <host name='172.31.252.2' port='6789'/>
@@ -13,7 +13,7 @@
     <host name='172.31.252.8' port='6789'/>
     <host name='172.31.252.9' port='6789'/>
     <host name='172.31.252.10' port='6789'/>
-    <name>ha</name>
+    <name>one</name>
     <auth type='ceph' username='libvirt'>
       <secret uuid='c14a16b5-bba5-473a-ae9b-53a9a6b0a4e3'/>
     </auth>
~ # diff -y /tmp/ha-images.xml /tmp/opennebula-images.xml

When I start this errant pool, I get this:
~ # virsh pool-start opennebula-images
error: Failed to start pool opennebula-images
error: An error occurred, but the cause is unknown

If I crank debugging up in `libvirtd` (via the not-recommended `log_level` and directing all output to a file), I see it successfully connects to the pool for about 15 seconds, lists the sizes of about a dozen disk images, then seemingly gives up and disconnects.

2025-01-19 05:16:55.176+0000: 3609: info : vir_object_finalize:319 : OBJECT_DISPOSE: obj=0x7f975fc816a0
2025-01-19 05:16:55.177+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fc816a0
2025-01-19 05:16:55.183+0000: 3609: debug : virStorageBackendRBDRefreshPool:693 : Utilization of RBD pool one: (kb: 19535134720 kb_
avail: 12800438616 num_bytes: 5489355030528)
2025-01-19 05:16:55.988+0000: 3609: debug : volStorageBackendRBDRefreshVolInfo:569 : Refreshed RBD image one/mastodon-vda (capacity
: 21474836480 allocation: 21474836480 obj_size: 4194304 num_objs: 5120)
2025-01-19 05:16:55.993+0000: 3609: info : virObjectNew:256 : OBJECT_NEW: obj=0x7f975fa6bba0 classname=virStorageVolObj
2025-01-19 05:16:55.993+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975fa6bba0
2025-01-19 05:16:55.993+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975fa6bba0
2025-01-19 05:16:55.993+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975fa6bba0
2025-01-19 05:16:55.993+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fa6bba0
2025-01-19 05:16:56.011+0000: 3609: debug : volStorageBackendRBDRefreshVolInfo:569 : Refreshed RBD image one/mastodon-vdb (capacity
: 536870912000 allocation: 536870912000 obj_size: 4194304 num_objs: 128000)
…snip…
2025-01-19 05:17:03.756+0000: 3609: debug : volStorageBackendRBDRefreshVolInfo:569 : Refreshed RBD image one/wsmail-vdb (capacity: 21474836480 allocation: 21474836480 obj_size: 4194304 num_objs: 5120)
2025-01-19 05:17:03.758+0000: 3609: info : virObjectNew:256 : OBJECT_NEW: obj=0x7f975f9cf250 classname=virStorageVolObj
2025-01-19 05:17:03.758+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975f9cf250
2025-01-19 05:17:03.758+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975f9cf250
2025-01-19 05:17:03.758+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975f9cf250
2025-01-19 05:17:03.758+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975f9cf250
2025-01-19 05:17:03.777+0000: 3609: debug : volStorageBackendRBDRefreshVolInfo:569 : Refreshed RBD image one/sjl-router-obsd76-vda (capacity: 34359738368 allocation: 34359738368 obj_size: 4194304 num_objs: 8192)
2025-01-19 05:17:03.778+0000: 3609: debug : virStorageBackendRBDCloseRADOSConn:369 : Closing RADOS IoCTX
2025-01-19 05:17:03.778+0000: 3609: debug : virStorageBackendRBDCloseRADOSConn:374 : Closing RADOS connection
2025-01-19 05:17:03.783+0000: 3609: debug : virStorageBackendRBDCloseRADOSConn:378 : RADOS connection existed for 15 seconds
2025-01-19 05:17:03.783+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975f9cf2b0
2025-01-19 05:17:03.783+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975ef90ac0
2025-01-19 05:17:03.783+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fa6bd20
2025-01-19 05:17:03.783+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fa6ecc0
…snip…
2025-01-19 05:17:03.785+0000: 3609: info : vir_object_finalize:319 : OBJECT_DISPOSE: obj=0x7f975f9cee90
2025-01-19 05:17:03.785+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975f9cee90
2025-01-19 05:17:03.785+0000: 3609: info : vir_object_finalize:319 : OBJECT_DISPOSE: obj=0x7f975fa6e960
2025-01-19 05:17:03.785+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fa6e960
2025-01-19 05:17:03.785+0000: 3609: error : storageDriverAutostartCallback:213 : internal error: Failed to autostart storage pool 'opennebula-images': no error
2025-01-19 05:17:03.785+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fcd0490
2025-01-19 05:17:03.785+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975fcd27c0
2025-01-19 05:17:03.785+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fcd27c0
2025-01-19 05:17:03.786+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975fcd06d0
2025-01-19 05:17:03.786+0000: 3609: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7f975fcd06d0
2025-01-19 05:17:03.786+0000: 3609: info : virObjectRef:400 : OBJECT_REF: obj=0x7f975fcd0130

If there's no cause for the error, it should not fail.
If it fails, there should be a cause listed, there's no excuse for it being "unknown" -- just because Microsoft's OSes make up error codes that its own help system can't explain is no excuse for the open-source world to follow their example.

I'd happily provide more information, if someone can provide guidance on how to locate it.
--
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.




[Index of Archives]     [Virt Tools]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux