default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
# rados -p .rgw.buckets get default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d testfile
error getting .rgw.buckets/default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d: (2) No such file or directory
# rados -p .rgw.buckets rm default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
# rados -p .rgw.buckets ls | grep -i "sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
Last time I had an inconsistent PG that could not be repaired using the repair command, I looked at which OSDs hosted the PG, then restarted them one by one(usually stopping, waiting a few seconds, then starting them back up ). You could also stop them, flush the journal, then start them back up.
If that didn’t work, it meant there was data loss and I had to use the ceph-objectstore-tool repair tool to export the objects from a location that had the latest data and import into the one that had no data. The ceph-objectstore-tool is not a simple thing though and should not be used lightly. When I say data loss, I mean that ceph thinks the last place written has the data, that place being the OSD that doesn’t actually have the data(meaning it failed to write there).
If you want to go that route, let me know, I wrote a how to on it. Should be the last resort though. I also don’t know your setup, so I would hate to recommend something so drastic.
-Brent
From: ceph-users [mailto:ceph-users-bounces@
lists.ceph.com ] On Behalf Of Arvydas Opulskis
Sent: Monday, August 6, 2018 4:12 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re: Inconsistent PG could not be repaired
Hi again,
after two weeks I've got another inconsistent PG in same cluster. OSD's are different from first PG, object can not be GET as well:
# rados list-inconsistent-obj 26.821 --format=json-pretty{
"epoch": 178472,
"inconsistents": [
{
"object": {
"name": "default.122888368.52__shadow_
.3ubGZwLcz0oQ55-LTb7PCOTwKkv- nQf_7", "nspace": "",
"locator": "",
"snap": "head",
"version": 118920
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:8411bae4:::default.
122888368.52__shadow_. 3ubGZwLcz0oQ55-LTb7PCOTwKkv- nQf_7:head(126495'118920 client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv 118920 dd cd142aaa od ffffffff alloc_hint [0 0])", "shards": [
{
"osd": 20,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
},
{
"osd": 44,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
}
]
}
]
}
# rados -p .rgw.buckets get default.122888368.52__shadow_.
3ubGZwLcz0oQ55-LTb7PCOTwKkv- nQf_7 test_2pg.file error getting .rgw.buckets/default.
122888368.52__shadow_. 3ubGZwLcz0oQ55-LTb7PCOTwKkv- nQf_7: (5) Input/output error
Still struggling how to solve it. Any ideas, guys?
Thank you
On Tue, Jul 24, 2018 at 10:27 AM, Arvydas Opulskis <zebediejus@xxxxxxxxx> wrote:
Hello, Cephers,
after trying different repair approaches I am out of ideas how to repair inconsistent PG. I hope, someones sharp eye will notice what I overlooked.
Some info about cluster:
Centos 7.4
Jewel 10.2.10
Pool size 2 (yes, I know it's a very bad choice)
Pool with inconsistent PG: .rgw.buckets
After routine deep-scrub I've found PG 26.c3f in inconsistent status. While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log, I noticed these errors:
2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid 26:fc32a1f1:::default.
142609570.87_20180206.093111% 2frepositories%2fnuget-local% 2fApplication%2fCompany. Application.Api%2fCompany. Application.Api.1.1.1.nupkg. artifactory-metadata% 2fproperties.xml:head data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 26:e261561a:::default. 168602061.10_team-xxx.xxx- jobs.H6.HADOOP.data- segmentation.application.131. xxx-jvm.cpu.load%2f2018-05- 05T03%3a51%3a39+00%3a00.sha1: head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid 26:fc32a1f1:::default.
142609570.87_20180206.093111% 2frepositories%2fnuget-local% 2fApplication%2fCompany. Application.Api%2fCompany. Application.Api.1.1.1.nupkg. artifactory-metadata% 2fproperties.xml:head data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 26:e261561a:::default. 168602061.10_team-xxx.xxx- jobs.H6.HADOOP.data- segmentation.application.131. xxx-jvm.cpu.load%2f2018-05- 05T03%3a51%3a39+00%3a00.sha1: head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default.
142609570.87_20180206.093111% 2frepositories%2fnuget-local% 2fApplication%2fCompany. Application.Api%2fCompany. Application.Api.1.1.1.nupkg. artifactory-metadata% 2fproperties.xml:head: failed to pick suitable auth object
...and same errors about another object on same PG.
Repair failed, so I checked inconsistencies "rados list-inconsistent-obj 26.c3f --format=json-pretty":
{
"epoch": 178403,
"inconsistents": [
{
"object": {
"name": "default.142609570.87_
20180203.020047\/repositories\ /docker-local\/yyy\/company. yyy.api.assets\/1.2.4\/sha256_ _ ce41e5246ead8bddd2a2b5bbb863db 250f328be9dc5c3041481d778a32f8 130d", "nspace": "",
"locator": "",
"snap": "head",
"version": 217749
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:f4ce1748:::default.
168602061.10_team-xxx.xxx- jobs.H6.HADOOP.data- segmentation.application.131. xxx-jvm.cpu.load%2f2018-05- 08T03%3a45%3a15+00%3a00.sha1: head(167944'217749 client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749 dd 422f251b od ffffffff alloc_hint [0 0])", "shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
}
]
},
{
"object": {
"name": "default.142609570.87_
20180206.093111\/repositories\ /nuget-local\/Application\/ Company.Application.Api\/ Company.Application.Api.1.1.1. nupkg.artifactory-metadata\/ properties.xml", "nspace": "",
"locator": "",
"snap": "head",
"version": 216051
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:e261561a:::default.
168602061.10_team-xxx.xxx- jobs.H6.HADOOP.data- segmentation.application.131. xxx-jvm.cpu.load%2f2018-05- 05T03%3a51%3a39+00%3a00.sha1: head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])", "shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
}
]
}
]
}
After some reading, I understand, I needed rados get/put trick to solve this problem. I couldn't do rados get, because I was getting "no such file" error, even objects were listed by "rados ls" command, so I got them directly from OSD. After putting them back to rados (rados commands doesn't returned any errors) and doing deep-scrub on same PG, problem still existed. The only thing changed - when I try to get object via rados now I get "(5) Input/output error".
I tried force object size to 40 (it's real size of both objects) by adding "-o 40" option to "rados put" command, but with no luck.
Guys, maybe you have other ideas what to try? Why overwriting object doesn't solve this problem?
Thanks a lot!
Arvydas
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com