Hi, First thing that comes to mind when it comes to data unavailability or inconsistencies after a power outage is that some dirty data may have been lost along the IO path before reaching persistent storage. This can happen with non enterprise grade SSDs using non-persistent cache or with HDDs disk buffer if left enabled for example. With that said, have you tried to deep-scrub the PG from which you can't retrieve data? What's the status of this PG now? Did it recover? Regards, Frédéric. ________________________________ De : wu_chulin@xxxxxx Envoyé : mercredi 31 juillet 2024 05:49 À : ceph-users Objet : Please guide us in identifying the cause of the data miss in EC pool Dear Ceph team: On July 13th at 4:55 AM, our Ceph cluster experienced a significant power outage in the data center, causing a large number of OSDs to power off and restart (total: 1172, down: 821). Approximately two hours later, all OSDs successfully started, and the cluster resumed its services. However, around 6 PM, the business department reported that some files, which had been successfully written (via the RGW service), were failing to download, and the number of such files was quite significant. Consequently, we began a series of investigations: 1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, and norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were UP&IN, and subsequently, we executed `ceph osd unpause`. 2. We randomly selected a problematic file and attempted to download it via the S3 API. The RGW returned "No such key". 3. The RGW logs showed op status=-2, http status=200. We also checked the upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, http_status=200. 4. We set debug_rgw=20 and attempted to download the file again. It was found that a 4M chunk(this file is 64M) failed to get. 5. Using rados get for this chunk returned: "No such file or directory". 6. Setting debug_osd=20, we observed get_object_context: obc NOT found in cache. 7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != '0xfffffffffffffffeffffffffffffffff'o'. 8. We stopped the primary OSD and tried to get the file again, but the result was the same. The object’s corresponding PG state was active+recovery_wait+degraded. 9. Using ceph-objectstore-tool --op list && --op log, we could not find the object information. The ceph-kvstore-tool rocksdb command also did not reveal anything new. 10. If an OSD had lost data, the PG state should have been unfound or inconsistency. 11. We started reanalyzing the startup logs of the OSDs related to the PG. The pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, and after peering, the PG state became ACTIVE. 12. We divided the lost files, and the upload time was before the failure occurred. The earliest upload time was around 1 am, and the successful upload records could be found in the RGW log 13. We have submitted an issue on the Ceph issue tracker: https://tracker.ceph.com/issues/66942, it includes the original logs needed for troubleshooting. However, four days have passed without any response. In desperation, we are sending this email, hoping that someone from the Ceph team can guide us as soon as possible. We are currently in a difficult situation and hope you can provide guidance. Thank you. Best regards. wu_chulin@xxxxxx wu_chulin@xxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx