Hi Eugen, thank you for the reply.
The OSD was drained over the weekend, so OSD 223 and 269 have only the
problematic PG 404.bc.
I don't think moving the PG would help since I don't have any empty OSD
to move it to, and a move would not fix the hash mismatch.
The reason I just want to have the problematic PG on the OSDs is to
reduce recovery time.
I would need to set min_size to 4 in an EC 4+2, and stop them both at
the same time to force a rebuild of the corrupted part of PG that is on
osd 223 and 269, since repair doesn't fix it.
I'm debating with myself if I should
1. Stop both OSD 223 and 269,
2. Just one of them.
Stopping them both, I'm guarantied that part of the PG on 223 and 269 is
rebuild from the 4 other, 297, 276, 136 and 197 that doesn't have any
errors.
OSD 223 is the master in the EC, pg 404.bc acting
[223,297,269,276,136,197]
So maybe just stop that one, wait for recovery and the run deep-scrub to
check if things look better.
But would it then use corrupted data on osd 269 to rebuild.
-
Kai Stian Olstad
On 26.02.2024 10:19, Eugen Block wrote:
Hi,
I think your approach makes sense. But I'm wondering if moving only
the problematic PGs to different OSDs could have an effect as well. I
assume that moving the 2 PGs is much quicker than moving all BUT those
2 PGs. If that doesn't work you could still fall back to draining the
entire OSDs (except for the problematic PG).
Regards,
Eugen
Zitat von Kai Stian Olstad <ceph+list@xxxxxxxxxx>:
Hi,
No one have any comment at all?
I'm not picky so any speculation, guessing, I would, I wouldn't,
should work and so one would be highly appreciated.
Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve it
I think the following might work.
pg 404.bc acting [223,297,269,276,136,197]
- Use pgremapper to move all PG on OSD 223 and 269 except 404.bc to
other OSD.
- Set min_since to 4, ceph osd pool set default.rgw.buckets.data
min_size 4
- Stop osd 223 and 269
What I hope will happen is that Ceph then recreate 404.bc shard
s0(osd.223) and s2(osd.269) since they are now down from the
remaining shards
s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197)
_Any_ comment is highly appreciated.
-
Kai Stian Olstad
On 21.02.2024 13:27, Kai Stian Olstad wrote:
Hi,
Short summary
PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698
objects.
Ceph pg repair doesn't fix it, because if you run deep-srub on the
PG after repair is finished, it still report scrub errors.
Why can't ceph pg repair repair this, it has 4 out of 6 should be
able to reconstruct the corrupted shards?
Is there a way to fix this? Like delete object s0 and s2 so it's
forced to recreate them?
Long detailed summary
A short backstory.
* This is aftermath of problems with mclock, post "17.2.7:
Backfilling deadlock / stall / stuck / standstill" [1].
- 4 OSDs had a few bad sectors, set all 4 out and cluster stopped.
- Solution was to swap from mclock to wpq and restart alle OSD.
- When all backfilling was finished all 4 OSD was replaced.
- osd.223 and osd.269 was 2 of the 4 OSDs that was replaced.
PG / pool 404 is EC 4+2 default.rgw.buckets.data
9 days after the osd.223 og osd.269 was replaced, deep-scub was run
and reported errors
ceph status
-----------
HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg
inconsistent
[ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 404.bc is active+clean+inconsistent, acting
[223,297,269,276,136,197]
I then run repair
ceph pg repair 404.bc
And ceph status showed this
ceph status
-----------
HEALTH_WARN Too many repaired reads on 2 OSDs
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 698 reads repaired
osd.269 had 698 reads repaired
But osd.223 and osd.269 is new disks and the disks has no SMART
error or any I/O error in OS logs.
So I tried to run deep-scrub again on the PG.
ceph pg deep-scrub 404.bc
And got this result.
ceph status
-----------
HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs;
Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 698 reads repaired
osd.269 had 698 reads repaired
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 404.bc is active+clean+scrubbing+deep+inconsistent+repair,
acting [223,297,269,276,136,197]
698 + 698 = 1396 so the same amount of errors.
Run repair again on 404.bc and ceph status is
HEALTH_WARN Too many repaired reads on 2 OSDs
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 1396 reads repaired
osd.269 had 1396 reads repaired
So even when repair finish it doesn't fix the problem since they
reappear again after a deep-scrub.
The log for osd.223 and osd.269 contain "got incorrect hash on read"
and "candidate had an ec hash mismatch" for 698 unique objects.
But i only show the logs for 1 of the 698 object, the log is the
same for the other 697 objects.
osd.223 log (only showing 1 of 698 object named
2021-11-08T19%3a43%3a50,145489260+00%3a00)
-----------
Feb 20 10:31:00 ceph-hd-003 ceph-osd[3665432]: osd.223 pg_epoch:
231235 pg[404.bcs0( v 231235'1636919 (231078'1632435,231235'1636919]
local-lis/les=226263/226264 n=296580 ec=36041/27862
lis/c=226263/226263 les/c/f=226264/230954/0 sis=226263)
[223,297,269,276,136,197]p223(0) r=0 lpr=226263 crt=231235'1636919
lcod 231235'1636918 mlcod 231235'1636918
active+clean+scrubbing+deep+inconsistent+repair [ 404.bcs0:
REQ_SCRUB ] MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned
REQ_SCRUB] _scan_list
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
got incorrect hash on read 0xc5d1dd1b != expected 0x7c2f86d7
Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]:
log_channel(cluster) log [ERR] : 404.bc shard 223(0) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]:
log_channel(cluster) log [ERR] : 404.bc shard 269(2) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
Feb 20 10:31:01 ceph-hd-003
ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]:
2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster)
log [ERR] : 404.bc shard 223(0) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
Feb 20 10:31:01 ceph-hd-003
ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]:
2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster)
log [ERR] : 404.bc shard 269(2) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
osd.269 log (only showing 1 of 698 object named
2021-11-08T19%3a43%3a50,145489260+00%3a00)
-----------
Feb 20 10:31:00 ceph-hd-001 ceph-osd[3656897]: osd.269 pg_epoch:
231235 pg[404.bcs2( v 231235'1636919 (231078'1632435,231235'1636919]
local-lis/les=226263/226264 n=296580 ec=36041/27862
lis/c=226263/226263 les/c/f=226264/230954/0 sis=226263)
[223,297,269,276,136,197]p223(0) r=2 lpr=226263 luod=0'0
crt=231235'1636919 mlcod 231235'1636919 active mbc={}] _scan_list
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
got incorrect hash on read 0x7c0871dc != expected 0xcf6f4c58
The log for the other osd in the PG osd.297, osd.276, osd.136 and
osd.197 doesn't show any error.
If I try to get the object it failes
$ s3cmd s3://benchfiles/2021-11-08T19:43:50,145489260+00:00
download: 's3://benchfiles/2021-11-08T19:43:50,145489260+00:00'
-> './2021-11-08T19:43:50,145489260+00:00' [1 of 1]
ERROR: Download of './2021-11-08T19:43:50,145489260+00:00' failed
(Reason: 500 (UnknownError))
ERROR: S3 error: 500 (UnknownError)
And the RGW log show this
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new
request req=0x7f94b744d660 =====
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: WARNING:
set_req_state_err err_no=5 resorting to 500
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new
request req=0x7f94b6e41660 =====
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== req done
req=0x7f94b744d660 op status=-5 http_status=500 latency=0.020000568s
======
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: beast: 0x7f94b744d660:
110.2.0.46 - test1 [21/Feb/2024:08:27:06.021 +0000] "GET
/benchfiles/2021-11-08T19%3A43%3A50%2C145489260%2B00%3A00 HTTP/1.1"
500 226 - - - latency=0.020000568s
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG
--
Kai Stian Olstad
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx