Strange behavior of OSD after an IO error

Vincent Godin <vince.mlist@xxxxxxxxx> · Wed, 27 Sep 2017 16:37:53 +0200

I have a pretty big ceph cluster in jewel 10.2.7 (2 PB) and IO errors
are common. I notice two different behaviors of OSDs when they
encounter an IO error

On JBOD mode, the OSD get the IO error and then abort ... The pgs are
peering on others OSDs, the OSD restart, get it's pgs and everything
is going as usual. Ceph health is OK and the degraded object is not
declared to the cluster. So, if another read is needed on the same
object, the process will restart (abort, perring, restart, peering).
Only a deep-scrub will tell the cluster that one pg have a degraded
object. Then a repair command will repair the pg (only a write
operation can force a disk to reallocate a bad sector). Why the
degraded object is not declared when the first read operation get the
IO error ? If we have multiple IO errors on different OSDs, different
read operations will make them flap till a deep-scrub ... The abort
operation seems to be the current behavior according to the code !!!
How the rbd_client deals with that (IO error or retry till the new
peering of the pg ?)

osd log

IO error and abort

2017-09-27 00:55:52.100678 7faceba7b700 -1
filestore(/var/lib/ceph/osd/ceph-276)
FileStore::read(9.103_head/#9:c086eeb2:::rbd_data.6592c12eb141f2.0000000000058795:head#)
pread error: (5) Input/output error
2017-09-27 00:55:52.128147 7faceba7b700 -1 os/filestore/FileStore.cc:
In function 'virtual int FileStore::read(const coll_t&, const
ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool) '
thread 7faceba7b700 time 2017-09-27 00:55:52.101208
os/filestore/FileStore.cc: 3016: FAILED assert(0 == "eio on pread")

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7fad141219f5]
 2: (FileStore::read(coll_t const&, ghobject_t const&, unsigned long,
unsigned long, ceph::buffer::list&, unsigned int, bool)+0xd91)
[0x7fad13df0661]
...

then restart

2017-09-27 00:56:26.859560 7fd53b8b0800  0 osd.276 341224 load_pgs
2017-09-27 00:56:35.881155 7fd53b8b0800  0 osd.276 341224 load_pgs
opened 167 pgs
2017-09-27 00:56:35.881531 7fd53b8b0800  0 osd.276 341224 using 0 op
queue with priority op cut off at 64.
2017-09-27 00:56:35.883165 7fd53b8b0800 -1 osd.276 341224
log_to_monitors {default=true}
2017-09-27 00:56:36.875207 7fd53b8b0800  0 osd.276 341224 done with
init, starting boot process

then deep-scrub (manually)

2017-09-27 10:34:22.481133 7fd50b196700  0 log_channel(cluster) log
[INF] : 9.103 deep-scrub starts
2017-09-27 10:44:05.819995 7fd50b196700 -1 log_channel(cluster) log
[ERR] : 9.103 shard 276: soid
9:c086eeb2:::rbd_data.6592c12eb141f2.0000000000058795:head candidate
had a read error
2017-09-27 10:55:22.272655 7fd508991700 -1 log_channel(cluster) log
[ERR] : 9.103 deep-scrub 0 missing, 1 inconsistent objects
2017-09-27 10:55:22.272663 7fd508991700 -1 log_channel(cluster) log
[ERR] : 9.103 deep-scrub 1 errors

then repair

2017-09-27 11:01:10.648292 7fd50b196700  0 log_channel(cluster) log
[INF] : 9.103 repair starts
2017-09-27 11:08:57.329571 7fd50b196700 -1 log_channel(cluster) log
[ERR] : 9.103 shard 276: soid
9:c086eeb2:::rbd_data.6592c12eb141f2.0000000000058795:head candidate
had a read error
2017-09-27 11:18:42.498541 7fd50b196700 -1 log_channel(cluster) log
[ERR] : 9.103 repair 0 missing, 1 inconsistent objects
2017-09-27 11:18:42.498730 7fd50b196700 -1 log_channel(cluster) log
[ERR] : 9.103 repair 1 errors, 1 fixed

The second behavior is more linked to the hardware and i just want to
give a drawback of using raid0 disk

On our cluster, whe used OSD hosts with JBOD and some olders with RAID0 disks
On RAID0 hosts, we notice than when a IO error occured, the OSD never
catch the information (it only appears in dmesg). On OSD log, we can
only see the thread timing out and blocked requests. The OSD behaviour
then depend of the load. Sometime it survive, sometime not but from a
client side, i think it's an IO error for sure. It's probably linked
to my hardware (SmartArray on HP SL4540) but maybe not ...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html