Re: Running fstrim (discard) inside KVM machine with RBD as disk device corrupts ext4 filesystem

Georg Schönberger <g.schoenberger@xxxxxxxxxx> · Tue, 4 Aug 2020 08:12:31 +0200

On 03.08.20 14:56, Jason Dillaman wrote:
On Mon, Aug 3, 2020 at 4:11 AM Georg Schönberger
<g.schoenberger@xxxxxxxxxx> wrote:
Hey Ceph users,

we are currently facing some serious problems on our Ceph Cluster with
libvirt (KVM), RBD devices and FSTRIM running inside VMs.

The problem is right after running the fstrim command inside the VM the
ext4 filesystem is corrupted and read-only with the following error message:

EXT4-fs error (device sda1): ext4_mb_generate_buddy:756: group 136,
block bitmap and bg descriptor inconsistent: 32200 vs 32768 free clusters
Aborting journal on device sda1-8
EXT4-fs (sda1): Remounting filesystem read-only
EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected
aborted journal
EXT4-fs (sda1): Remounting filesystem read-only

This behavior is reproducible across several VMs with different OS
(Ubuntu 14.04, 16.04 and 18.04) so we guess it is a bug or a
configuration problem regarding RBD devices.

Our setup on the hosts running the VMs looks like:
# lsb_release -d
Description:    Ubuntu 20.04 LTS
# uname -a
Linux XXX 5.4.0-37-generic #41-Ubuntu SMP Wed Jun 3 18:57:02 UTC 2020
x86_64 x86_64 x86_64 GNU/Linux
# ceph --version
ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus
(stable)

-> I know there's the update to Ceph 15.2.4 but I haven't seen any
fstrim/discard related changes in the changelog. If we could fix the
problem with 15.2.4 I would be happy...

The libvirt config for the RBD device with supporting fstrim (discard)
is the following:

      <disk type='network' device='disk'>
        <driver name='qemu' type='raw' cache='directsync' io='native'
discard='unmap'/>
        <auth username='libvirt'>
          <secret type='ceph' usage='client.libvirt'/>
        </auth>
        <source protocol='rbd' name='cephstorage/testtrim_system'>
          <host name='XXX' port='6789'/>
          <host name='XXX' port='6789'/>
          <host name='XXX' port='6789'/>
          <host name='XXX' port='6789'/>
          <host name='XXX' port='6789'/>
        </source>
        <target dev='sda' bus='scsi'/>
        <boot order='2'/>
        <address type='drive' controller='0' bus='0' target='0' unit='0'/>
      </disk>

The ceph docs (https://docs.ceph.com/docs/octopus/rbd/qemu-rbd/) gave me
some hints about enabling trim/discard and I tested using 4M as discard
granularity, but I got the same error resulting in a corrupted ext4 file
system.
Changes made to the libvirt config:
    <qemu:commandline>
      <qemu:arg value='-set'/>
      <qemu:arg value='device.scsi0-0-0-0.discard_granularity=4194304'/>
    </qemu:commandline>

As the RBD devices are thin-provisioned we really need calling fstrim
inside the VM regularly to free up unused blocks, otherwise our Ceph
pool will run out of space.

Any ideas what could be wrong with our RBD setup or can somebody else
reproduce the problem?
Any hints on how to debug this problem?
Any related/open Ceph issues? (I could not fined one)
I haven't heard of any similar issue. I would recommend trying an
older release of librbd1 (i.e. Nautilus), older release of QEMU, or
older guest OS in different combinations to see what the common factor
is.

Thanks a lot for your help, Georg
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Digging deeper into the problem I can tell that fstrim is not the cause 
but the trigger that leads to a read-only ext4 filesystem.

Our problem only occurs with VMs that were recently migrated from one 
Ceph Cluster to another one with rbd export and import. This is how this 
was done:
1. rbd first snap of running VM
2. rbd export-diff of first snap -> rbd import-diff in new cluster
3. Stop VM
4. rbd second snap of stopped VM
5. rbd export-diff of second snap with --from-snap first snap -> rbd 
import-diff in new cluster

In some cases we now got corrupted ext4 filesystem with this type of 
migration but it is not clear to us why because the VM got stopped in 
step 3!
Anything wrong with our sequence of commands?

One thing we think it could a possible cause is the enabled rbd cache in 
our ceph.conf:
[client]
rbd_cache = true
rbd_cache_writethrough_until_flush = false
rbd_cache_size = 536870912
rbd_cache_max_dirty = 134217728
rbd_cache_target_dirty = 33554432
rbd_cache_max_dirty_age = 5

So if step 4 is directly done after step 3 without waiting for the rbd 
cache to be flushed, can this be the cause for data corruption?
Any rbd commands to tell Ceph to flush the rbd cache?

THX, Georg

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx