Hi Alvaro,
Can you post the entire Ceph status output?
Pasting here since it is short
cluster:
id: d9000ec0-93c2-479f-bd5d-94ae9673e347
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent
services:
mon: 3 daemons, quorum node-4,node-5,node-6 (age 52m)
mgr: node-5(active, since 7d), standbys: node-6, node-4
mds: 1/1 daemons up, 2 standby
osd: 36 osds: 36 up (since 5d), 36 in (since 6d)
data:
volumes: 1/1 healthy
pools: 3 pools, 832 pgs
objects: 506.83M objects, 67 TiB
usage: 207 TiB used, 232 TiB / 439 TiB avail
pgs: 826 active+clean
5 active+clean+scrubbing+deep
1 active+clean+inconsistent
io:
client: 18 MiB/s wr, 0 op/s rd, 5 op/s wr
sometimes list-inconsistent-obj throws that error if a scrub job is still running.
This would be surprising to me, because I did the disk replacement of the broken OSD "2" already 7 days ago, and "list-inconsistent-obj" has not worked at any time since then.
grep -Hn 'ERR' /var/log/ceph/ceph-osd.33.log
/var/log/ceph/ceph-osd.33.log:8005229:2023-06-16T16:29:57.704+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 shard 2 soid 2:e18c2025:::1001c78d046.00000000:head : candidate had a read error
/var/log/ceph/ceph-osd.33.log:8018716:2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 0 missing, 1 inconsistent objects
/var/log/ceph/ceph-osd.33.log:8018717:2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 1 errors
The time "2023-06-16T16:29:57" above is the time at which the disk that carried OSD "2" broke, its logs around the time are:
/var/log/ceph/ceph-osd.2.log:7855741:2023-06-16T16:29:57.690+0000 7fbae3cf7640 -1 bdev(0x7fbaeef6c400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error)
/var/log/ceph/ceph-osd.2.log:7855743:2023-06-16T16:29:57.690+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8df449f9:::10016e7a962.00000000:head, will try copies on 19,32
/var/log/ceph/ceph-osd.2.log:7855747:2023-06-16T16:29:57.691+0000 7fba63064640 -1 log_channel(cluster) log [ERR] : 2.a6 missing primary copy of 2:65bd8cda:::10016ea4e67.00000000:head, will try copies on 17,28
-- note time jump by 3 days --
/var/log/ceph/ceph-osd.2.log:8096330:2023-06-19T06:42:48.712+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
/var/log/ceph/ceph-osd.2.log:8108684: -1867> 2023-06-19T06:42:48.712+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
/var/log/ceph/ceph-osd.2.log:8108766: -1785> 2023-06-19T06:42:49.035+0000 7fba6d879640 10 log_client will send 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
/var/log/ceph/ceph-osd.2.log:8108770: -1781> 2023-06-19T06:42:49.525+0000 7fba7787f640 10 log_client logged 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
/var/log/ceph/ceph-osd.2.log:8111339:2023-06-19T06:51:13.940+0000 7fb1518126c0 -1 ** ERROR: osd init failed: (5) Input/output error
Does "candidate had a read error" on OSD "33" mean that a BlueStore checksum error was detected on OSD "33" at the same time as the OSD "2" disk failed?
If yes, maybe that is the explanation:
* pg 2.87 is backed by OSDs [33,2,20]; OSD 2's hardware broke during the scrub, OSD 33 detected a checksum error during the scrub, and thus we have 2 OSDs left (33 and 20) whose checksums disagree.
I am just guessing this, though.
Also, if this is correct, the next question would be: What is with OSD 20?
Since there is no error reported at all for OSD 20, I assume that its checksum agrees with its data.
Now, can I find out whether OSD 20's checksum agrees with OSD 33's data?
(Side note: The disk of OSD 33 looks fine in smartctl.)
Thanks,
Niklas
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx