Re: Running fstrim (discard) inside KVM machine with RBD as disk device corrupts ext4 filesystem

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 4 Aug 2020 08:19:40 -0400

On Tue, Aug 4, 2020 at 2:12 AM Georg Schönberger
<g.schoenberger@xxxxxxxxxx> wrote:
>
> On 03.08.20 14:56, Jason Dillaman wrote:
> > On Mon, Aug 3, 2020 at 4:11 AM Georg Schönberger
> > <g.schoenberger@xxxxxxxxxx> wrote:
> >> Hey Ceph users,
> >>
> >> we are currently facing some serious problems on our Ceph Cluster with
> >> libvirt (KVM), RBD devices and FSTRIM running inside VMs.
> >>
> >> The problem is right after running the fstrim command inside the VM the
> >> ext4 filesystem is corrupted and read-only with the following error message:
> >>
> >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:756: group 136,
> >> block bitmap and bg descriptor inconsistent: 32200 vs 32768 free clusters
> >> Aborting journal on device sda1-8
> >> EXT4-fs (sda1): Remounting filesystem read-only
> >> EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected
> >> aborted journal
> >> EXT4-fs (sda1): Remounting filesystem read-only
> >>
> >> This behavior is reproducible across several VMs with different OS
> >> (Ubuntu 14.04, 16.04 and 18.04) so we guess it is a bug or a
> >> configuration problem regarding RBD devices.
> >>
> >> Our setup on the hosts running the VMs looks like:
> >> # lsb_release -d
> >> Description:    Ubuntu 20.04 LTS
> >> # uname -a
> >> Linux XXX 5.4.0-37-generic #41-Ubuntu SMP Wed Jun 3 18:57:02 UTC 2020
> >> x86_64 x86_64 x86_64 GNU/Linux
> >> # ceph --version
> >> ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus
> >> (stable)
> >>
> >> -> I know there's the update to Ceph 15.2.4 but I haven't seen any
> >> fstrim/discard related changes in the changelog. If we could fix the
> >> problem with 15.2.4 I would be happy...
> >>
> >> The libvirt config for the RBD device with supporting fstrim (discard)
> >> is the following:
> >>
> >>       <disk type='network' device='disk'>
> >>         <driver name='qemu' type='raw' cache='directsync' io='native'
> >> discard='unmap'/>
> >>         <auth username='libvirt'>
> >>           <secret type='ceph' usage='client.libvirt'/>
> >>         </auth>
> >>         <source protocol='rbd' name='cephstorage/testtrim_system'>
> >>           <host name='XXX' port='6789'/>
> >>           <host name='XXX' port='6789'/>
> >>           <host name='XXX' port='6789'/>
> >>           <host name='XXX' port='6789'/>
> >>           <host name='XXX' port='6789'/>
> >>         </source>
> >>         <target dev='sda' bus='scsi'/>
> >>         <boot order='2'/>
> >>         <address type='drive' controller='0' bus='0' target='0' unit='0'/>
> >>       </disk>
> >>
> >> The ceph docs (https://docs.ceph.com/docs/octopus/rbd/qemu-rbd/) gave me
> >> some hints about enabling trim/discard and I tested using 4M as discard
> >> granularity, but I got the same error resulting in a corrupted ext4 file
> >> system.
> >> Changes made to the libvirt config:
> >>     <qemu:commandline>
> >>       <qemu:arg value='-set'/>
> >>       <qemu:arg value='device.scsi0-0-0-0.discard_granularity=4194304'/>
> >>     </qemu:commandline>
> >>
> >> As the RBD devices are thin-provisioned we really need calling fstrim
> >> inside the VM regularly to free up unused blocks, otherwise our Ceph
> >> pool will run out of space.
> >>
> >> Any ideas what could be wrong with our RBD setup or can somebody else
> >> reproduce the problem?
> >> Any hints on how to debug this problem?
> >> Any related/open Ceph issues? (I could not fined one)
> > I haven't heard of any similar issue. I would recommend trying an
> > older release of librbd1 (i.e. Nautilus), older release of QEMU, or
> > older guest OS in different combinations to see what the common factor
> > is.
> >
> >> Thanks a lot for your help, Georg
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> Digging deeper into the problem I can tell that fstrim is not the cause
> but the trigger that leads to a read-only ext4 filesystem.
>
> Our problem only occurs with VMs that were recently migrated from one
> Ceph Cluster to another one with rbd export and import. This is how this
> was done:
> 1. rbd first snap of running VM
> 2. rbd export-diff of first snap -> rbd import-diff in new cluster
> 3. Stop VM
> 4. rbd second snap of stopped VM
> 5. rbd export-diff of second snap with --from-snap first snap -> rbd
> import-diff in new cluster

Ah, well that might be this issue [1]. Can you re-test using the
latest available Octopus dev release of librbd1 / ceph-common (for the
'rbd' CLI) [2]?

> In some cases we now got corrupted ext4 filesystem with this type of
> migration but it is not clear to us why because the VM got stopped in
> step 3!
> Anything wrong with our sequence of commands?
>
> One thing we think it could a possible cause is the enabled rbd cache in
> our ceph.conf:
> [client]
> rbd_cache = true
> rbd_cache_writethrough_until_flush = false
> rbd_cache_size = 536870912
> rbd_cache_max_dirty = 134217728
> rbd_cache_target_dirty = 33554432
> rbd_cache_max_dirty_age = 5
>
> So if step 4 is directly done after step 3 without waiting for the rbd
> cache to be flushed, can this be the cause for data corruption?
> Any rbd commands to tell Ceph to flush the rbd cache?
>
> THX, Georg
>
>

[1] https://tracker.ceph.com/issues/46674
[2] https://shaman.ceph.com/repos/ceph/octopus/

-- 
Jason
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx