Running fstrim (discard) inside KVM machine with RBD as disk device corrupts ext4 filesystem

Georg Schönberger <g.schoenberger@xxxxxxxxxx> · Mon, 3 Aug 2020 10:10:45 +0200

Hey Ceph users,

we are currently facing some serious problems on our Ceph Cluster with 
libvirt (KVM), RBD devices and FSTRIM running inside VMs.

The problem is right after running the fstrim command inside the VM the 
ext4 filesystem is corrupted and read-only with the following error message:

EXT4-fs error (device sda1): ext4_mb_generate_buddy:756: group 136, 
block bitmap and bg descriptor inconsistent: 32200 vs 32768 free clusters
Aborting journal on device sda1-8
EXT4-fs (sda1): Remounting filesystem read-only
EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected 
aborted journal
EXT4-fs (sda1): Remounting filesystem read-only

This behavior is reproducible across several VMs with different OS 
(Ubuntu 14.04, 16.04 and 18.04) so we guess it is a bug or a 
configuration problem regarding RBD devices.

Our setup on the hosts running the VMs looks like:
# lsb_release -d
Description:    Ubuntu 20.04 LTS
# uname -a
Linux XXX 5.4.0-37-generic #41-Ubuntu SMP Wed Jun 3 18:57:02 UTC 2020 
x86_64 x86_64 x86_64 GNU/Linux
# ceph --version
ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus 
(stable)

-> I know there's the update to Ceph 15.2.4 but I haven't seen any 
fstrim/discard related changes in the changelog. If we could fix the 
problem with 15.2.4 I would be happy...

The libvirt config for the RBD device with supporting fstrim (discard) 
is the following:

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='directsync' io='native' 
discard='unmap'/>
      <auth username='libvirt'>
        <secret type='ceph' usage='client.libvirt'/>
      </auth>
      <source protocol='rbd' name='cephstorage/testtrim_system'>
        <host name='XXX' port='6789'/>
        <host name='XXX' port='6789'/>
        <host name='XXX' port='6789'/>
        <host name='XXX' port='6789'/>
        <host name='XXX' port='6789'/>
      </source>
      <target dev='sda' bus='scsi'/>
      <boot order='2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

The ceph docs (https://docs.ceph.com/docs/octopus/rbd/qemu-rbd/) gave me 
some hints about enabling trim/discard and I tested using 4M as discard 
granularity, but I got the same error resulting in a corrupted ext4 file 
system.
Changes made to the libvirt config:
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.scsi0-0-0-0.discard_granularity=4194304'/>
  </qemu:commandline>

As the RBD devices are thin-provisioned we really need calling fstrim 
inside the VM regularly to free up unused blocks, otherwise our Ceph 
pool will run out of space.

Any ideas what could be wrong with our RBD setup or can somebody else 
reproduce the problem?
Any hints on how to debug this problem?
Any related/open Ceph issues? (I could not fined one)

Thanks a lot for your help, Georg
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx