bluestore behavior on disks sector read errors

SCHAER Frederic <frederic.schaer@xxxxxx> · Tue, 27 Jun 2017 09:17:49 +0000

Hi,

Every now and then , sectors die on disks.
When this happens on my bluestore (kraken) OSDs, I get 1 PG that becomes degraded.
The exact status is :

HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 12.127 is active+clean+inconsistent, acting [141,67,85]

If I do a # rados list-inconsistent-obj 12.127 --format=json-pretty
I get :
(…)
                    "osd": 112,
                    "errors": [
                        "read_error"
                    ],
                    "size": 4194304

When this happens, I’m forced to manually run “ceph pg repair” on the inconsistent PGs after I made sure this was a read error : I feel this should not be a manual process.

If I go on the machine and look at the syslogs, I indeed see a sector read error happened once or twice.
But if I try to read the sector manually, then I can because it was reallocated on the disk I presume.
Last time this happened, I ran badblocks on the disk and it found no issue…

My question therefore are : 

why doen’t bluestore retry reading the sector (in case of transient errors) ? (maybe it does)
why isn’t the pg automatically fixed when a read error was detected ?
what will happen when the disks get old and reach up to 2048 bad sectors before the controllers/smart declare them as “failure predicted” ?
I can’t imagine manually fixing  up to Nx2048 PGs in an infrastructure of N disks where N could reach the sky…

Ideas ?

Thanks && regards

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com