Re: cannot repair a handful of damaged pg's

Simon Oosthoek <s.oosthoek@xxxxxxxxxxxxx> · Fri, 6 Oct 2023 17:01:53 +0200

On 06/10/2023 16:09, Simon Oosthoek wrote:
Hi

we're still in HEALTH_ERR state with our cluster, this is the top of the 
output of `ceph health detail`

HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors; 
Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent; 
Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg 
degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657 pgs not 
scrubbed in time
[WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%)
     pg 26.323 has 1 unfound objects
[ERR] OSD_SCRUB_ERRORS: 248 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs 
inconsistent
     pg 26.323 is active+recovery_unfound+degraded+remapped, acting 
[92,109,116,70,158,128,243,189,256], 1 unfound
     pg 26.337 is active+clean+inconsistent, acting 
[139,137,48,126,165,89,237,199,189]
     pg 26.3e2 is active+clean+inconsistent, acting 
[12,27,24,234,195,173,98,32,35]
[WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects 
degraded (0.000%), 1 pg degraded, 1 pg undersized
     pg 13.3a5 is stuck undersized for 4m, current state 
active+undersized+remapped+backfilling, last acting 
[2,45,32,62,2147483647,55,116,25,225,202,240]
     pg 26.323 is active+recovery_unfound+degraded+remapped, acting 
[92,109,116,70,158,128,243,189,256], 1 unfound

For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc., 
however it fails to get resolved.

The osd.116 is already marked out and is beginning to get empty. I've 
tried restarting the osd processes of the first osd listed for each PG, 
but that doesn't get it resolved either.

I guess we should have enough redundancy to get the correct data back, 
but how can I tell ceph to fix it in order to get back to a healthy state?

I guess this could be related to the number of scrubs going on, I read 
somewhere that this may interfere with the repair request. I would 
expect the repair would have priority over scrubs...

BTW, we're running pacific for now, we want to update when the cluster 
is healthy again.

Cheers

/Simon

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx