Hello all! We are experiencing a strange problem with QEMU virtual machines where the virtual machine image is hosted on a gluster volume. Access via fuse. (Our GFAPI attempt failed, it doesn’t seem to work properly with current QEMU/distro/gluster).
We have the volume tuned for ‘virt’. So we use qemu-img to create a raw image. You can use sparse or falloc with equal results. We start a virtual machine (libvirt, qemu-kvm) and libvirt/qemu points to the fuse mount with the QEMU image file we created. When we create partitions and filesystems – like you might do for installing an operating system – all is well at first. This includes a root XFS filesystem. When we try to re-make the XFS filesystem over the old one, it will not mount and will report XFS corruption. If you dig into XFS repair, you can find a UUID mismatch between the superblock and the log. The log always retains the UUID of the original filesystem (the one we tried to replace). Running xfs_repair doesn’t truly repair, it just reports
more corruption. xfs_db forcing to remake the log doesn’t help. We can duplicate this with even a QEMU raw image of 50 megabytes. As far as we can tell, XFS is the only filesystem showing this behavior or at least the only one reporting a problem. If we take QEMU out of the picture and create partitions directly on the QEMU raw image file, then use kpartx to create devices to the partitions, and run a similar test – the gluster-hosted image behaves as you would expect and there is
no problem reported by XFS. We can’t duplicate the problem outside of QEMU. We have observed the issue with Rocky 9.4 and SLES15 SP5 environments (including the matching QEMU versions). We have not tested more distros yet. We observed the problem originally with Gluster 9.3. We reproduced it with Gluster 9.6 and 10.5. If we switch from QEMU RAW to QCOW2, the problem disappears. The problem is not reproduced when we take gluster out of the equation (meaning, pointing QEMU at a local disk image instead of gluster-hosted one – that works fine). The problem can be reproduced this way:
Now start the virtual machine that refers to the above adminvm.img file
Here are the volume settings: # gluster volume info adminvm Volume Name: adminvm Type: Replicate Volume ID: de655913-aad9-4e17-bac4-ff0ad9c28223 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 172.23.254.181:/data/brick_adminvm_slot2 Brick2: 172.23.254.182:/data/brick_adminvm_slot2 Brick3: 172.23.254.183:/data/brick_adminvm_slot2 Options Reconfigured: storage.owner-gid: 107 storage.owner-uid: 107 performance.io-thread-count: 32 network.frame-timeout: 10800 cluster.lookup-optimize: off server.keepalive-count: 5 server.keepalive-interval: 2 server.keepalive-time: 10 server.tcp-user-timeout: 20 network.ping-timeout: 20 server.event-threads: 4 client.event-threads: 4 cluster.choose-local: off user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable performance.strict-o-direct: on network.remote-dio: disable performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off cluster.granular-entry-heal: enable storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on Any help or ideas would be appreciated. Let us know if we have a setting incorrect or have made an error. Thank you all! Erik |
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users