Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"

Kai Stian Olstad <kaistian@xxxxxxxxxx> · Tue, 27 Feb 2024 17:13:27 +0100

Hi Eugen, thank you for the reply.

The OSD was drained over the weekend, so OSD 223 and 269 have only the 
problematic PG 404.bc.

I don't think moving the PG would help since I don't have any empty OSD 
to move it to, and a move would not fix the hash mismatch.
The reason I just want to have the problematic PG on the OSDs is to 
reduce recovery time.
I would need to set min_size to 4 in an EC 4+2, and stop them both at 
the same time to force a rebuild of the corrupted part of PG that is on 
osd 223 and 269, since repair doesn't fix it.

I'm debating with myself if I should
1. Stop both OSD 223 and 269,
2. Just one of them.

Stopping them both, I'm guarantied that part of the PG on 223 and 269 is 
rebuild from the 4 other, 297, 276, 136 and 197 that doesn't have any 
errors.

OSD 223 is the master in the EC, pg 404.bc acting 
[223,297,269,276,136,197]
So maybe just stop that one, wait for recovery and the run deep-scrub to 
check if things look better.
But would it then use corrupted data on osd 269 to rebuild.

-
Kai Stian Olstad

On 26.02.2024 10:19, Eugen Block wrote:
Hi,

I think your approach makes sense. But I'm wondering if moving only  
the problematic PGs to different OSDs could have an effect as well. I  
assume that moving the 2 PGs is much quicker than moving all BUT those  
2 PGs. If that doesn't work you could still fall back to draining the  
entire OSDs (except for the problematic PG).

Regards,
Eugen

Zitat von Kai Stian Olstad <ceph+list@xxxxxxxxxx>:

Hi,

No one have any comment at all?
I'm not picky so any speculation, guessing, I would, I wouldn't,  
should work and so one would be highly appreciated.

Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve it  
I think the following might work.

pg 404.bc acting [223,297,269,276,136,197]

- Use pgremapper to move all PG on OSD 223 and 269 except 404.bc to  
other OSD.
- Set min_since to 4, ceph osd pool set default.rgw.buckets.data 
min_size 4
- Stop osd 223 and 269

What I hope will happen is that Ceph then recreate 404.bc shard  
s0(osd.223) and s2(osd.269) since they are now down from the  
remaining shards
s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197)

_Any_ comment is highly appreciated.

-
Kai Stian Olstad

On 21.02.2024 13:27, Kai Stian Olstad wrote:
Hi,

Short summary

PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698 
objects.
Ceph pg repair doesn't fix it, because if you run deep-srub on the  
PG after repair is finished, it still report scrub errors.

Why can't ceph pg repair repair this, it has 4 out of 6 should be  
able to reconstruct the corrupted shards?
Is there a way to fix this? Like delete object s0 and s2 so it's  
forced to recreate them?

Long detailed summary

A short backstory.
* This is aftermath of problems with mclock, post "17.2.7:  
Backfilling deadlock / stall / stuck / standstill" [1].
 - 4 OSDs had a few bad sectors, set all 4 out and cluster stopped.
 - Solution was to swap from mclock to wpq and restart alle OSD.
 - When all backfilling was finished all 4 OSD was replaced.
 - osd.223 and osd.269 was 2 of the 4 OSDs that was replaced.

PG / pool 404 is EC 4+2 default.rgw.buckets.data

9 days after the osd.223 og osd.269 was replaced, deep-scub was run  
and reported errors
   ceph status
   -----------
   HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg 
inconsistent
   [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
   [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
       pg 404.bc is active+clean+inconsistent, acting  
[223,297,269,276,136,197]

I then run repair
   ceph pg repair 404.bc

And ceph status showed this
   ceph status
   -----------
   HEALTH_WARN Too many repaired reads on 2 OSDs
   [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
       osd.223 had 698 reads repaired
       osd.269 had 698 reads repaired

But osd.223 and osd.269 is new disks and the disks has no SMART  
error or any I/O error in OS logs.
So I tried to run deep-scrub again on the PG.
   ceph pg deep-scrub 404.bc

And got this result.

   ceph status
   -----------
   HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs;  
Possible data damage: 1 pg inconsistent
   [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
   [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
       osd.223 had 698 reads repaired
       osd.269 had 698 reads repaired
   [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
       pg 404.bc is  active+clean+scrubbing+deep+inconsistent+repair, 
acting  [223,297,269,276,136,197]

698 + 698 = 1396 so the same amount of errors.

Run repair again on 404.bc and ceph status is

   HEALTH_WARN Too many repaired reads on 2 OSDs
   [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
       osd.223 had 1396 reads repaired
       osd.269 had 1396 reads repaired

So even when repair finish it doesn't fix the problem since they  
reappear again after a deep-scrub.

The log for osd.223 and osd.269 contain "got incorrect hash on  read" 
and "candidate had an ec hash mismatch" for 698 unique objects.
But i only show the logs for 1 of the 698 object, the log is the  
same for the other 697 objects.

   osd.223 log (only showing 1 of 698 object named  
2021-11-08T19%3a43%3a50,145489260+00%3a00)
   -----------
   Feb 20 10:31:00 ceph-hd-003 ceph-osd[3665432]: osd.223 pg_epoch:  
231235 pg[404.bcs0( v 231235'1636919  (231078'1632435,231235'1636919] 
local-lis/les=226263/226264  n=296580 ec=36041/27862 
lis/c=226263/226263 les/c/f=226264/230954/0  sis=226263) 
[223,297,269,276,136,197]p223(0) r=0 lpr=226263  crt=231235'1636919 
lcod 231235'1636918 mlcod 231235'1636918  
active+clean+scrubbing+deep+inconsistent+repair [ 404.bcs0:   
REQ_SCRUB ]  MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned  
REQ_SCRUB] _scan_list   
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
got incorrect hash on read 0xc5d1dd1b !=  expected  0x7c2f86d7
   Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]:  
log_channel(cluster) log [ERR] : 404.bc shard 223(0) soid  
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
: candidate had an ec hash  mismatch
   Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]:  
log_channel(cluster) log [ERR] : 404.bc shard 269(2) soid  
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
: candidate had an ec hash  mismatch
   Feb 20 10:31:01 ceph-hd-003  
ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]:  
2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster)  
log [ERR] : 404.bc shard 223(0) soid  
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
: candidate had an ec hash  mismatch
   Feb 20 10:31:01 ceph-hd-003  
ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]:  
2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster)  
log [ERR] : 404.bc shard 269(2) soid  
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
: candidate had an ec hash  mismatch

   osd.269 log (only showing 1 of 698 object named  
2021-11-08T19%3a43%3a50,145489260+00%3a00)
   -----------
   Feb 20 10:31:00 ceph-hd-001 ceph-osd[3656897]: osd.269 pg_epoch:  
231235 pg[404.bcs2( v 231235'1636919  (231078'1632435,231235'1636919] 
local-lis/les=226263/226264  n=296580 ec=36041/27862 
lis/c=226263/226263 les/c/f=226264/230954/0  sis=226263) 
[223,297,269,276,136,197]p223(0) r=2 lpr=226263  luod=0'0 
crt=231235'1636919 mlcod 231235'1636919 active mbc={}]  _scan_list   
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
got incorrect hash on read 0x7c0871dc !=  expected  0xcf6f4c58

The log for the other osd in the PG osd.297, osd.276, osd.136 and  
osd.197 doesn't show any error.

If I try to get the object it failes
   $ s3cmd s3://benchfiles/2021-11-08T19:43:50,145489260+00:00
   download: 's3://benchfiles/2021-11-08T19:43:50,145489260+00:00'  
-> './2021-11-08T19:43:50,145489260+00:00'  [1 of 1]
   ERROR: Download of './2021-11-08T19:43:50,145489260+00:00'  failed 
(Reason: 500 (UnknownError))
   ERROR: S3 error: 500 (UnknownError)

And the RGW log show this
   Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new  
request req=0x7f94b744d660 =====
   Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: WARNING:  
set_req_state_err err_no=5 resorting to 500
   Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new  
request req=0x7f94b6e41660 =====
   Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== req done  
req=0x7f94b744d660 op status=-5 http_status=500  latency=0.020000568s 
======
   Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: beast: 0x7f94b744d660:  
110.2.0.46 - test1 [21/Feb/2024:08:27:06.021 +0000] "GET  
/benchfiles/2021-11-08T19%3A43%3A50%2C145489260%2B00%3A00 HTTP/1.1"  
500 226 - - - latency=0.020000568s

[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG

--
Kai Stian Olstad
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx