Re: Read errors on OSD

Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx> · Thu, 1 Jun 2017 13:38:52 +0000

I've seen similar issues in the past with 4U Supermicro servers populated with spinning disks. In my case it turned out to be a specific firmware+BIOS combination on the disk controller card that was buggy. I fixed it by updating the firmware and BIOS
 on the card to the latest versions.

I saw this on several servers, and it took a while to track down as you can imagine. Same symptoms you're reporting.

There was a data corruption problem a while back with the Linux kernel and Samsung 850 Pro drives, but your problem doesn't sound like data corruption. Still, I'd check to make sure the kernel version you're running has the fix.

      Steve Taylor | Senior Software Engineer 
      | StorageCraft Technology Corporation
380 Data Drive Suite 300 
      | Draper | Utah | 84020
Office: 801.871.2799 | 

If you are not the intended recipient of this 
message or received it erroneously, please notify the sender and delete it, 
together with any attachments, and be advised that any dissemination or copying 
of this message is prohibited.

On Thu, 2017-06-01 at 13:40 +0100, Oliver Humpage wrote:

On 1 Jun 2017, at 11:55, Matthew Vernon <mv3@xxxxxxxxxxxx> wrote:

You don't say what's in kern.log - we've had (rotating) disks that were throwing read errors but still saying they were OK on SMART.

Fair point. There was nothing correlating to the time that ceph logged an error this morning, which is why I didn’t mention it, but looking harder I see yesterday there was a

May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Sense Key : Hardware Error [current] 
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Add. Sense: Internal target failure
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 CDB: Read(10) 28 00 77 51 42 d8 00 02 00 00
May 31 07:20:13 osd1 kernel: blk_update_request: critical target error, dev sdi, sector 2001814232

sdi was the disk with the OSD affected today. Guess it’s flakey SSDs then. 

Weird that just re-reading the file makes everything OK though - wondering how much it’s worth worrying about that, or if there’s a way of making ceph retry reads automatically?

Oliver.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com