Re: 1 pg inconsistent and does not recover

Frank Schilder <frans@xxxxxx> · Wed, 28 Jun 2023 08:45:57 +0000

Hi Stefan,

we run Octopus. The deep-scrub request is (immediately) cancelled if the PG/OSD is already part of another (deep-)scrub or if some peering happens. As far as I understood, the commands osd/pg deep-scrub and pg repair do not create persistent reservations. If you issue this command, when does the PG actually start scrubbing? As soon as another one finishes or when it is its natural turn? Do you monitor the scrub order to confirm it was the manual command that initiated a scrub?

What I see is that the pg repair and the pg deep-scrub are almost immediately forgotten on our cluster. This is most prominent with the repair command, which can be really hard to get going and complete. Only an osd deep-scrub seems to have some effect. On the other hand, when I run the script, which stops all operations that conflict with manual reservations, the repair/deep-scrub actually start on request.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: Wednesday, June 28, 2023 9:54 AM
To: Frank Schilder; Alexander E. Patrakov; Niklas Hambüchen
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: 1 pg inconsistent and does not recover

On 6/28/23 09:41, Frank Schilder wrote:
> Hi Niklas,
>
> please don't do any of the recovery steps yet! Your problem is almost certainly a non-issue. I had a failed disk with 3 scrub-errors, leading to the candidate read error messeges you have:
>
> ceph status/df/pool stats/health detail at 00:00:06:
>    cluster:
>      health: HEALTH_ERR
>              3 scrub errors
>              Possible data damage: 3 pgs inconsistent
>
> After rebuilding the data, it still looked like:
>
>    cluster:
>      health: HEALTH_ERR
>              2 scrub errors
>              Possible data damage: 2 pgs inconsistent
>
> What's the issue here? The issue is that the OGs have not been deep-scrubbed after rebuild. The reply "no scrub data available" of the list-inconsistent is the clue. The response to that is not to try manual repair but to issue a deep-scrub.
>
> Unfortunately, the command "ceph pg deep-scrub ..." does not really work, the deep scrub reservation almost always gets cancelled very quickly.

On what Ceph version do you have this issue? We use this command
everyday, hunderds of times, and it always works.

Or is this an issue when you have a degraded cluster?

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx