Please guide us in identifying the cause of the data miss in EC pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Ceph team:     On July 13th at 4:55 AM, our Ceph cluster experienced a significant power outage in the data center, causing a large number of OSDs to power off and restart (total: 1172, down: 821). Approximately two hours later, all OSDs successfully started, and the cluster resumed its services. However, around 6 PM, the business department reported that some files, which had been successfully written (via the RGW service), were failing to download, and the number of such files was quite significant. Consequently, we began a series of investigations:


1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, and norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were UP&IN, and subsequently, we executed `ceph osd unpause`.


2. We randomly selected a problematic file and attempted to download it via the S3 API. The RGW returned "No such key".


3. The RGW logs showed op status=-2, http status=200. We also checked the upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, http_status=200.


4. We set debug_rgw=20 and attempted to download the file again. It was found that a 4M chunk(this file is 64M) failed to get.


5. Using rados get for this chunk returned: "No such file or directory".


6. Setting debug_osd=20, we observed get_object_context: obc NOT found in cache.


7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != '0xfffffffffffffffeffffffffffffffff'o'.


8. We stopped the primary OSD and tried to get the file again, but the result was the same. The object’s corresponding PG state was active+recovery_wait+degraded.


9. Using ceph-objectstore-tool --op list && --op log, we could not find the object information. The ceph-kvstore-tool rocksdb command also did not reveal anything new.


10. If an OSD had lost data, the PG state should have been unfound or inconsistency.


11. We started reanalyzing the startup logs of the OSDs related to the PG. The pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, and after peering, the PG state became ACTIVE.


12. We divided the lost files, and the upload time was before the failure occurred. The earliest upload time was around 1 am, and the successful upload records could be found in the RGW log


13. We have submitted an issue on the Ceph issue tracker: https://tracker.ceph.com/issues/66942, it includes the original logs needed for troubleshooting. However, four days have passed without any response. In desperation, we are sending this email, hoping that someone from the Ceph team can guide us as soon as possible.


We are currently in a difficult situation and hope you can provide guidance. Thank you.



Best regards.





wu_chulin@xxxxxx
wu_chulin@xxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux