Dear Ceph users,
I have a ceph cluster (ceph version 16.2.7) running on Proxmox 7 (Debian
11) with a Debian 11 guest (11.3).
On this specific host, every ~ 3-4 weeks, the file system is set to
readonly. I then need to reboot the system and manually check the
filesystem, then it works again.
In the syslog I find the following entries:
------------ snip ------------------
[Do Jul 7 13:11:43 2022] device-mapper: uevent: version 1.0.3
[Do Jul 7 13:11:43 2022] device-mapper: ioctl: 4.43.0-ioctl
(2020-10-01) initialised: dm-devel@xxxxxxxxxx
[Do Jul 7 13:11:45 2022] SGI XFS with ACLs, security attributes,
realtime, quota, no debug enabled
[Do Jul 7 13:11:45 2022] JFS: nTxBlock = 8192, nTxLock = 65536
[Do Jul 7 13:11:45 2022] QNX4 filesystem 0.2.3 registered.
[Do Jul 7 13:11:46 2022] raid6: sse2x4 gen() 8327 MB/s
[Do Jul 7 13:11:46 2022] raid6: sse2x4 xor() 3454 MB/s
[Do Jul 7 13:11:46 2022] raid6: sse2x2 gen() 6684 MB/s
[Do Jul 7 13:11:46 2022] raid6: sse2x2 xor() 7149 MB/s
[Do Jul 7 13:11:46 2022] raid6: sse2x1 gen() 6194 MB/s
[Do Jul 7 13:11:46 2022] raid6: sse2x1 xor() 5981 MB/s
[Do Jul 7 13:11:46 2022] raid6: using algorithm sse2x4 gen() 8327 MB/s
[Do Jul 7 13:11:46 2022] raid6: .... xor() 3454 MB/s, rmw enabled
[Do Jul 7 13:11:46 2022] raid6: using intx1 recovery algorithm
[Do Jul 7 13:11:46 2022] xor: measuring software checksum speed
[Do Jul 7 13:11:46 2022] prefetch64-sse : 12793 MB/sec
[Do Jul 7 13:11:46 2022] generic_sse : 9978 MB/sec
[Do Jul 7 13:11:46 2022] xor: using function: prefetch64-sse (12793 MB/sec)
[Do Jul 7 13:11:46 2022] Btrfs loaded, crc32c=crc32c-generic
[Fr Jul 8 14:11:35 2022] EXT4-fs (vdc): error count since last fsck: 7
[Fr Jul 8 14:11:35 2022] EXT4-fs (vdc): initial error at time
1654815634: ext4_check_bdev_write_error:215
[Fr Jul 8 14:11:35 2022] EXT4-fs (vdc): last error at time 1657148419:
ext4_journal_check_start:83
[Sa Jul 9 15:49:30 2022] EXT4-fs (vdc): error count since last fsck: 7
[Sa Jul 9 15:49:30 2022] EXT4-fs (vdc): initial error at time
1654815634: ext4_check_bdev_write_error:215
[Sa Jul 9 15:49:30 2022] EXT4-fs (vdc): last error at time 1657148419:
ext4_journal_check_start:83
[So Jul 10 17:27:26 2022] EXT4-fs (vdc): error count since last fsck: 7
[So Jul 10 17:27:26 2022] EXT4-fs (vdc): initial error at time
1654815634: ext4_check_bdev_write_error:215
[So Jul 10 17:27:26 2022] EXT4-fs (vdc): last error at time 1657148419:
ext4_journal_check_start:83
------------ snip ------------------
First of all I personally don't know why there are any messages about
raid6, which I don't use at all on the guest.
Anyway, what troubles me are the file system errors below. Why are these
file system errors happening? It seems that these errors pile up and
after some time there are too many and then the system switches to readonly.
The strange thing is that I have some other guests on this cluster (and
the node) running and there are no such errors at all.
What I can think of is the following:
- The underlying storage are 4 Nodes with each 2 * 8TB Toshiba
harddisks. No harddisk has any SMART errors, btw.
- However, the storage can be VERY slow from time to time.
- Maybe in Debian 11 there is some timeout value, so that if the storage
is temporarily slow, this "ext4_check_bdev_write_error" happens?
Is this a reasonable explanation and if yes, is there some way to
increase this timeout?
Maybe it makes sense to switch from VirtIO to VirtIOSCSI as there such
errors are handled in a better way?
Or could it be that there is some kind of strange data loss in my ceph
setup and if yes - what can I do?
Best Regards,
Hermann
--
hermann@xxxxxxx
PGP/GPG: 299893C7 (on keyservers)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx