Re: 1 pg inconsistent and does not recover

Frank Schilder <frans@xxxxxx> · Wed, 28 Jun 2023 07:50:34 +0000

Just for reference for everybody, the original source is

https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh

maintained by Dan van der Ster. The repo as such is a rich source of good tools in general and worth looking at before doing anything that requires more than 30 minutes searching and reading documentation.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Wednesday, June 28, 2023 9:41 AM
To: Alexander E. Patrakov; Niklas Hambüchen
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: 1 pg inconsistent and does not recover

Hi Niklas,

please don't do any of the recovery steps yet! Your problem is almost certainly a non-issue. I had a failed disk with 3 scrub-errors, leading to the candidate read error messeges you have:

ceph status/df/pool stats/health detail at 00:00:06:
  cluster:
    health: HEALTH_ERR
            3 scrub errors
            Possible data damage: 3 pgs inconsistent

After rebuilding the data, it still looked like:

  cluster:
    health: HEALTH_ERR
            2 scrub errors
            Possible data damage: 2 pgs inconsistent

What's the issue here? The issue is that the OGs have not been deep-scrubbed after rebuild. The reply "no scrub data available" of the list-inconsistent is the clue. The response to that is not to try manual repair but to issue a deep-scrub.

Unfortunately, the command "ceph pg deep-scrub ..." does not really work, the deep scrub reservation almost always gets cancelled very quickly. I got a script to force repair/deep-scrub (I don't remember who sent it to me) and that gets the job done:

=====================
#!/bin/bash

[[ -r "/etc/profile.d/ceph.sh" ]] && source "/etc/profile.d/ceph.sh"

for PG in $(ceph pg ls inconsistent -f json | jq -r .pg_stats[].pgid)
do
   echo Checking inconsistent PG $PG
   if ceph pg ls repair | grep -wq ${PG}
   then
      echo PG $PG is already repairing, skipping
      continue
   fi

   # disable other scrubs
   ceph osd set nodeep-scrub
   ceph osd set noscrub

   # bump up osd_max_scrubs
   ACTING=$(ceph pg $PG query | jq -r .acting[])
   for OSD in $ACTING
   do
      cmd=( ceph tell osd.${OSD} injectargs -- --osd_max_scrubs=3 --osd_scrub_during_recovery=true )
      echo "executing: ${cmd[@]}"
      "${cmd[@]}"
   done

   ceph pg repair $PG

   sleep 10

   for OSD in $ACTING
   do
      cmd=( ceph tell osd.${OSD} injectargs -- --osd_max_scrubs=1 --osd_scrub_during_recovery=false )
      echo "executing: ${cmd[@]}"
      "${cmd[@]}"
   done

   # disable other scrubs
   ceph osd unset nodeep-scrub
   ceph osd unset noscrub
done
===================

You can also just wait for the regular deep-scrub to happen.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Alexander E. Patrakov <patrakov@xxxxxxxxx>
Sent: Wednesday, June 28, 2023 5:24 AM
To: Niklas Hambüchen
Cc: ceph-users@xxxxxxx
Subject:  Re: 1 pg inconsistent and does not recover

Hello Niklas,

The explanation looks plausible.

What you can do is try extracting the PG from the dead OSD disk
(please make absolutely sure that the OSD daemon is stopped!!!) and
reinjecting it into some other OSD (again, stop the daemon during this
procedure). This extra copy should act as an arbiter.

The relevant commands are:

systemctl stop ceph-osd@2
systemctl stop ceph-osd@3  # or whatever other OSD exists on the same host
systemctl mask ceph-osd@2
systemctl mask ceph-osd@3
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid
2.87 --op export --file /some/local/storage/pg-2.87.exp
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3/ --type
bluestore --pgid 2.87 --op import --file
/some/local/storage/pg-2.87.exp
systemctl unmask ceph-osd@3
systemctl start ceph-osd@3
systemctl unmask ceph-osd@2

On Wed, Jun 28, 2023 at 8:31 AM Niklas Hambüchen <mail@xxxxxx> wrote:
>
> Hi Alvaro,
>
> > Can you post the entire Ceph status output?
>
> Pasting here since it is short
>
>      cluster:
>        id:     d9000ec0-93c2-479f-bd5d-94ae9673e347
>        health: HEALTH_ERR
>                1 scrub errors
>                Possible data damage: 1 pg inconsistent
>
>      services:
>        mon: 3 daemons, quorum node-4,node-5,node-6 (age 52m)
>        mgr: node-5(active, since 7d), standbys: node-6, node-4
>        mds: 1/1 daemons up, 2 standby
>        osd: 36 osds: 36 up (since 5d), 36 in (since 6d)
>
>      data:
>        volumes: 1/1 healthy
>        pools:   3 pools, 832 pgs
>        objects: 506.83M objects, 67 TiB
>        usage:   207 TiB used, 232 TiB / 439 TiB avail
>        pgs:     826 active+clean
>                5   active+clean+scrubbing+deep
>                1   active+clean+inconsistent
>
>      io:
>        client:   18 MiB/s wr, 0 op/s rd, 5 op/s wr
>
>
> > sometimes list-inconsistent-obj throws that error if a scrub job is still running.
>
> This would be surprising to me, because I did the disk replacement of the broken OSD "2" already 7 days ago, and "list-inconsistent-obj" has not worked at any time since then.
>
> > grep -Hn 'ERR' /var/log/ceph/ceph-osd.33.log
>
>      /var/log/ceph/ceph-osd.33.log:8005229:2023-06-16T16:29:57.704+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 shard 2 soid 2:e18c2025:::1001c78d046.00000000:head : candidate had a read error
>      /var/log/ceph/ceph-osd.33.log:8018716:2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 0 missing, 1 inconsistent objects
>      /var/log/ceph/ceph-osd.33.log:8018717:2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 1 errors
>
> The time "2023-06-16T16:29:57" above is the time at which the disk that carried OSD "2" broke, its logs around the time are:
>
>      /var/log/ceph/ceph-osd.2.log:7855741:2023-06-16T16:29:57.690+0000 7fbae3cf7640 -1 bdev(0x7fbaeef6c400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error)
>      /var/log/ceph/ceph-osd.2.log:7855743:2023-06-16T16:29:57.690+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8df449f9:::10016e7a962.00000000:head, will try copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:7855747:2023-06-16T16:29:57.691+0000 7fba63064640 -1 log_channel(cluster) log [ERR] : 2.a6 missing primary copy of 2:65bd8cda:::10016ea4e67.00000000:head, will try copies on 17,28
>      -- note time jump by 3 days --
>      /var/log/ceph/ceph-osd.2.log:8096330:2023-06-19T06:42:48.712+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8108684: -1867> 2023-06-19T06:42:48.712+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8108766: -1785> 2023-06-19T06:42:49.035+0000 7fba6d879640 10 log_client  will send 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8108770: -1781> 2023-06-19T06:42:49.525+0000 7fba7787f640 10 log_client  logged 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8111339:2023-06-19T06:51:13.940+0000 7fb1518126c0 -1  ** ERROR: osd init failed: (5) Input/output error
>
> Does "candidate had a read error" on OSD "33" mean that a BlueStore checksum error was detected on OSD "33" at the same time as the OSD "2" disk failed?
> If yes, maybe that is the explanation:
>
> * pg 2.87 is backed by OSDs [33,2,20]; OSD 2's hardware broke during the scrub, OSD 33 detected a checksum error during the scrub, and thus we have 2 OSDs left (33 and 20) whose checksums disagree.
>
> I am just guessing this, though.
> Also, if this is correct, the next question would be: What is with OSD 20?
> Since there is no error reported at all for OSD 20, I assume that its checksum agrees with its data.
> Now, can I find out whether OSD 20's checksum agrees with OSD 33's data?
>
> (Side note: The disk of OSD 33 looks fine in smartctl.)
>
> Thanks,
> Niklas
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx