Ceph / Debian 11 guest / corrupted file system

Hermann Himmelbauer <hermann@xxxxxxx> · Mon, 11 Jul 2022 16:52:38 +0200

Dear Ceph users,
I have a ceph cluster (ceph version 16.2.7) running on Proxmox 7 (Debian 
11) with a Debian 11 guest (11.3).

On this specific host, every ~ 3-4 weeks, the file system is set to 
readonly. I then need to reboot the system and manually check the 
filesystem, then it works again.

In the syslog I find the following entries:

------------ snip ------------------
[Do Jul  7 13:11:43 2022] device-mapper: uevent: version 1.0.3
[Do Jul  7 13:11:43 2022] device-mapper: ioctl: 4.43.0-ioctl 
(2020-10-01) initialised: dm-devel@xxxxxxxxxx
[Do Jul  7 13:11:45 2022] SGI XFS with ACLs, security attributes, 
realtime, quota, no debug enabled
[Do Jul  7 13:11:45 2022] JFS: nTxBlock = 8192, nTxLock = 65536
[Do Jul  7 13:11:45 2022] QNX4 filesystem 0.2.3 registered.
[Do Jul  7 13:11:46 2022] raid6: sse2x4   gen()  8327 MB/s
[Do Jul  7 13:11:46 2022] raid6: sse2x4   xor()  3454 MB/s
[Do Jul  7 13:11:46 2022] raid6: sse2x2   gen()  6684 MB/s
[Do Jul  7 13:11:46 2022] raid6: sse2x2   xor()  7149 MB/s
[Do Jul  7 13:11:46 2022] raid6: sse2x1   gen()  6194 MB/s
[Do Jul  7 13:11:46 2022] raid6: sse2x1   xor()  5981 MB/s
[Do Jul  7 13:11:46 2022] raid6: using algorithm sse2x4 gen() 8327 MB/s
[Do Jul  7 13:11:46 2022] raid6: .... xor() 3454 MB/s, rmw enabled
[Do Jul  7 13:11:46 2022] raid6: using intx1 recovery algorithm
[Do Jul  7 13:11:46 2022] xor: measuring software checksum speed
[Do Jul  7 13:11:46 2022]    prefetch64-sse  : 12793 MB/sec
[Do Jul  7 13:11:46 2022]    generic_sse     :  9978 MB/sec
[Do Jul  7 13:11:46 2022] xor: using function: prefetch64-sse (12793 MB/sec)
[Do Jul  7 13:11:46 2022] Btrfs loaded, crc32c=crc32c-generic
[Fr Jul  8 14:11:35 2022] EXT4-fs (vdc): error count since last fsck: 7
[Fr Jul  8 14:11:35 2022] EXT4-fs (vdc): initial error at time 
1654815634: ext4_check_bdev_write_error:215
[Fr Jul  8 14:11:35 2022] EXT4-fs (vdc): last error at time 1657148419: 
ext4_journal_check_start:83
[Sa Jul  9 15:49:30 2022] EXT4-fs (vdc): error count since last fsck: 7
[Sa Jul  9 15:49:30 2022] EXT4-fs (vdc): initial error at time 
1654815634: ext4_check_bdev_write_error:215
[Sa Jul  9 15:49:30 2022] EXT4-fs (vdc): last error at time 1657148419: 
ext4_journal_check_start:83
[So Jul 10 17:27:26 2022] EXT4-fs (vdc): error count since last fsck: 7
[So Jul 10 17:27:26 2022] EXT4-fs (vdc): initial error at time 
1654815634: ext4_check_bdev_write_error:215
[So Jul 10 17:27:26 2022] EXT4-fs (vdc): last error at time 1657148419: 
ext4_journal_check_start:83
------------ snip ------------------

First of all I personally don't know why there are any messages about 
raid6, which I don't use at all on the guest.

Anyway, what troubles me are the file system errors below. Why are these 
file system errors happening? It seems that these errors pile up and 
after some time there are too many and then the system switches to readonly.

The strange thing is that I have some other guests on this cluster (and 
the node) running and there are no such errors at all.

What I can think of is the following:

- The underlying storage are 4 Nodes with each 2 * 8TB Toshiba 
harddisks. No harddisk has any SMART errors, btw.
- However, the storage can be VERY slow from time to time.
- Maybe in Debian 11 there is some timeout value, so that if the storage 
is temporarily slow, this "ext4_check_bdev_write_error" happens?

Is this a reasonable explanation and if yes, is there some way to 
increase this timeout?

Maybe it makes sense to switch from VirtIO to VirtIOSCSI as there such 
errors are handled in a better way?

Or could it be that there is some kind of strange data loss in my ceph 
setup and if yes - what can I do?

Best Regards,
Hermann

--
hermann@xxxxxxx
PGP/GPG: 299893C7 (on keyservers)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx