I also have a question that among the 9 OSDs, some have not been restarted. In theory, these OSDs should retain the object info(metadata,pg_log,etc.), even if the object cannot be recovered. I sorted out the OSD booting log where the object should be located and the PG peering process:
Original Email
From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/8 4:01
To:"wu_chulin"< wu_chulin@xxxxxx >;
Subject:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying thecause ofthedata miss in EC pool
Envoyé : mercredi 7 août 2024 11:30
À : Frédéric Nass
Objet : Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause ofthedata miss in EC pool
Original Email
From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/6 20:40
To:"wu_chulin"< wu_chulin@xxxxxx >;
Subject:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause ofthedata miss in EC pool
Envoyé : mardi 6 août 2024 12:15
À : Frédéric Nass
Objet : Re:Re: Re:Re: [ceph-users] Please guide us in identifying the cause ofthedata miss in EC pool
Original Email
From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/6 15:34
To:"wu_chulin"< wu_chulin@xxxxxx >;
Subject:Re: Re:Re: [ceph-users] Please guide us in identifying the cause ofthedata miss in EC pool
Envoyé : lundi 5 août 2024 10:09
À : Frédéric Nass
Objet : Re:Re: [ceph-users] Please guide us in identifying the cause of thedata miss in EC pool
Original Email
From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/3 15:11
To:"wu_chulin"< wu_chulin@xxxxxx >;"ceph-users"< ceph-users@xxxxxxx >;
Subject:Re: [ceph-users] Please guide us in identifying the cause of thedata miss in EC pool
Envoyé : mercredi 31 juillet 2024 05:49
À : ceph-users
Objet : [ceph-users] Please guide us in identifying the cause of the data miss in EC pool
1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, and norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were UP&IN, and subsequently, we executed `ceph osd unpause`.
2. We randomly selected a problematic file and attempted to download it via the S3 API. The RGW returned "No such key".
3. The RGW logs showed op status=-2, http status=200. We also checked the upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, http_status=200.
4. We set debug_rgw=20 and attempted to download the file again. It was found that a 4M chunk(this file is 64M) failed to get.
5. Using rados get for this chunk returned: "No such file or directory".
6. Setting debug_osd=20, we observed get_object_context: obc NOT found in cache.
7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != '0xfffffffffffffffeffffffffffffffff'o'.
8. We stopped the primary OSD and tried to get the file again, but the result was the same. The object’s corresponding PG state was active+recovery_wait+degraded.
9. Using ceph-objectstore-tool --op list && --op log, we could not find the object information. The ceph-kvstore-tool rocksdb command also did not reveal anything new.
10. If an OSD had lost data, the PG state should have been unfound or inconsistency.
11. We started reanalyzing the startup logs of the OSDs related to the PG. The pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, and after peering, the PG state became ACTIVE.
12. We divided the lost files, and the upload time was before the failure occurred. The earliest upload time was around 1 am, and the successful upload records could be found in the RGW log
13. We have submitted an issue on the Ceph issue tracker: https://tracker.ceph.com/issues/66942, it includes the original logs needed for troubleshooting. However, four days have passed without any response. In desperation, we are sending this email, hoping that someone from the Ceph team can guide us as soon as possible.
We are currently in a difficult situation and hope you can provide guidance. Thank you.
Best regards.
wu_chulin@xxxxxx
wu_chulin@xxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx