Hi, Every now and then , sectors die on disks. When this happens on my bluestore (kraken) OSDs, I get 1 PG that becomes degraded. The exact status is : HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 12.127 is active+clean+inconsistent, acting [141,67,85] If I do a # rados list-inconsistent-obj 12.127 --format=json-pretty I get : (…) "osd": 112, "errors": [ "read_error" ], "size": 4194304 When this happens, I’m forced to manually run “ceph pg repair” on the inconsistent PGs after I made sure this was a read error : I feel this should not be a manual process. If I go on the machine and look at the syslogs, I indeed see a sector read error happened once or twice. But if I try to read the sector manually, then I can because it was reallocated on the disk I presume. Last time this happened, I ran badblocks on the disk and it found no issue… My question therefore are : why doen’t bluestore retry reading the sector (in case of transient errors) ? (maybe it does) why isn’t the pg automatically fixed when a read error was detected ? what will happen when the disks get old and reach up to 2048 bad sectors before the controllers/smart declare them as “failure predicted” ? I can’t imagine manually fixing up to Nx2048 PGs in an infrastructure of N disks where N could reach the sky…
Ideas ? Thanks && regards |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com