Read errors on OSD

Oliver Humpage <oliver@xxxxxxxxxxxxxxx> · Thu, 1 Jun 2017 10:38:21 +0100

Hello,

We have a small cluster of 44 OSDs across 4 servers.

A few times a week, ceph health reports a pg is inconsistent. Looking at the relevant OSD’s logs, it always says "head candidate had a read error”. No other info, i.e. it’s not that the digest is wrong, it just has an I/O error. It’s usually a different OSD each time, so it’s not a specific disk/controller/server.

Manually running a deep scrub on the pg succeeds, and ceph health goes back to normal.

As a test today, before scrubbing the pg I found the relevant file in /var/lib/ceph/osd/… and cat(1)ed it. The first time I ran cat(1) on it I got an Input/output error. The second time I did it, however, it worked fine.

These read errors are all on Samsung 850 Pro 2TB disks (journals are on separate enterprise SSDs). The SMART status on all of them are similar and show nothing out of the ordinary.

Has anyone else experienced anything similar? Is this just a curse of non-enterprise SSDs, or do you think there might be something else going on, e.g. could it be an XFS issue? Any suggestions as to what to look at would be welcome.

Many thanks,

Oliver.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com