Hi Niklas, please don't do any of the recovery steps yet! Your problem is almost certainly a non-issue. I had a failed disk with 3 scrub-errors, leading to the candidate read error messeges you have: ceph status/df/pool stats/health detail at 00:00:06: cluster: health: HEALTH_ERR 3 scrub errors Possible data damage: 3 pgs inconsistent After rebuilding the data, it still looked like: cluster: health: HEALTH_ERR 2 scrub errors Possible data damage: 2 pgs inconsistent What's the issue here? The issue is that the OGs have not been deep-scrubbed after rebuild. The reply "no scrub data available" of the list-inconsistent is the clue. The response to that is not to try manual repair but to issue a deep-scrub. Unfortunately, the command "ceph pg deep-scrub ..." does not really work, the deep scrub reservation almost always gets cancelled very quickly. I got a script to force repair/deep-scrub (I don't remember who sent it to me) and that gets the job done: ===================== #!/bin/bash [[ -r "/etc/profile.d/ceph.sh" ]] && source "/etc/profile.d/ceph.sh" for PG in $(ceph pg ls inconsistent -f json | jq -r .pg_stats[].pgid) do echo Checking inconsistent PG $PG if ceph pg ls repair | grep -wq ${PG} then echo PG $PG is already repairing, skipping continue fi # disable other scrubs ceph osd set nodeep-scrub ceph osd set noscrub # bump up osd_max_scrubs ACTING=$(ceph pg $PG query | jq -r .acting[]) for OSD in $ACTING do cmd=( ceph tell osd.${OSD} injectargs -- --osd_max_scrubs=3 --osd_scrub_during_recovery=true ) echo "executing: ${cmd[@]}" "${cmd[@]}" done ceph pg repair $PG sleep 10 for OSD in $ACTING do cmd=( ceph tell osd.${OSD} injectargs -- --osd_max_scrubs=1 --osd_scrub_during_recovery=false ) echo "executing: ${cmd[@]}" "${cmd[@]}" done # disable other scrubs ceph osd unset nodeep-scrub ceph osd unset noscrub done =================== You can also just wait for the regular deep-scrub to happen. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Alexander E. Patrakov <patrakov@xxxxxxxxx> Sent: Wednesday, June 28, 2023 5:24 AM To: Niklas Hambüchen Cc: ceph-users@xxxxxxx Subject: Re: 1 pg inconsistent and does not recover Hello Niklas, The explanation looks plausible. What you can do is try extracting the PG from the dead OSD disk (please make absolutely sure that the OSD daemon is stopped!!!) and reinjecting it into some other OSD (again, stop the daemon during this procedure). This extra copy should act as an arbiter. The relevant commands are: systemctl stop ceph-osd@2 systemctl stop ceph-osd@3 # or whatever other OSD exists on the same host systemctl mask ceph-osd@2 systemctl mask ceph-osd@3 ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 2.87 --op export --file /some/local/storage/pg-2.87.exp ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3/ --type bluestore --pgid 2.87 --op import --file /some/local/storage/pg-2.87.exp systemctl unmask ceph-osd@3 systemctl start ceph-osd@3 systemctl unmask ceph-osd@2 On Wed, Jun 28, 2023 at 8:31 AM Niklas Hambüchen <mail@xxxxxx> wrote: > > Hi Alvaro, > > > Can you post the entire Ceph status output? > > Pasting here since it is short > > cluster: > id: d9000ec0-93c2-479f-bd5d-94ae9673e347 > health: HEALTH_ERR > 1 scrub errors > Possible data damage: 1 pg inconsistent > > services: > mon: 3 daemons, quorum node-4,node-5,node-6 (age 52m) > mgr: node-5(active, since 7d), standbys: node-6, node-4 > mds: 1/1 daemons up, 2 standby > osd: 36 osds: 36 up (since 5d), 36 in (since 6d) > > data: > volumes: 1/1 healthy > pools: 3 pools, 832 pgs > objects: 506.83M objects, 67 TiB > usage: 207 TiB used, 232 TiB / 439 TiB avail > pgs: 826 active+clean > 5 active+clean+scrubbing+deep > 1 active+clean+inconsistent > > io: > client: 18 MiB/s wr, 0 op/s rd, 5 op/s wr > > > > sometimes list-inconsistent-obj throws that error if a scrub job is still running. > > This would be surprising to me, because I did the disk replacement of the broken OSD "2" already 7 days ago, and "list-inconsistent-obj" has not worked at any time since then. > > > grep -Hn 'ERR' /var/log/ceph/ceph-osd.33.log > > /var/log/ceph/ceph-osd.33.log:8005229:2023-06-16T16:29:57.704+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 shard 2 soid 2:e18c2025:::1001c78d046.00000000:head : candidate had a read error > /var/log/ceph/ceph-osd.33.log:8018716:2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 0 missing, 1 inconsistent objects > /var/log/ceph/ceph-osd.33.log:8018717:2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 1 errors > > The time "2023-06-16T16:29:57" above is the time at which the disk that carried OSD "2" broke, its logs around the time are: > > /var/log/ceph/ceph-osd.2.log:7855741:2023-06-16T16:29:57.690+0000 7fbae3cf7640 -1 bdev(0x7fbaeef6c400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error) > /var/log/ceph/ceph-osd.2.log:7855743:2023-06-16T16:29:57.690+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8df449f9:::10016e7a962.00000000:head, will try copies on 19,32 > /var/log/ceph/ceph-osd.2.log:7855747:2023-06-16T16:29:57.691+0000 7fba63064640 -1 log_channel(cluster) log [ERR] : 2.a6 missing primary copy of 2:65bd8cda:::10016ea4e67.00000000:head, will try copies on 17,28 > -- note time jump by 3 days -- > /var/log/ceph/ceph-osd.2.log:8096330:2023-06-19T06:42:48.712+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32 > /var/log/ceph/ceph-osd.2.log:8108684: -1867> 2023-06-19T06:42:48.712+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32 > /var/log/ceph/ceph-osd.2.log:8108766: -1785> 2023-06-19T06:42:49.035+0000 7fba6d879640 10 log_client will send 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32 > /var/log/ceph/ceph-osd.2.log:8108770: -1781> 2023-06-19T06:42:49.525+0000 7fba7787f640 10 log_client logged 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32 > /var/log/ceph/ceph-osd.2.log:8111339:2023-06-19T06:51:13.940+0000 7fb1518126c0 -1 ** ERROR: osd init failed: (5) Input/output error > > Does "candidate had a read error" on OSD "33" mean that a BlueStore checksum error was detected on OSD "33" at the same time as the OSD "2" disk failed? > If yes, maybe that is the explanation: > > * pg 2.87 is backed by OSDs [33,2,20]; OSD 2's hardware broke during the scrub, OSD 33 detected a checksum error during the scrub, and thus we have 2 OSDs left (33 and 20) whose checksums disagree. > > I am just guessing this, though. > Also, if this is correct, the next question would be: What is with OSD 20? > Since there is no error reported at all for OSD 20, I assume that its checksum agrees with its data. > Now, can I find out whether OSD 20's checksum agrees with OSD 33's data? > > (Side note: The disk of OSD 33 looks fine in smartctl.) > > Thanks, > Niklas > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx