Hi guys,
So our cluster always got osd down due to medium error.Our current action plan is to replace the defective disk drive.But I was wondering whether it's too sensitive for ceph to take it down.Or whether our action plan was too simple and crude.Any advice for this issue will be appreciated.
- medium error from dmesg:
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Sense Key : Medium Error [current]
[Sun Nov 20 15:52:10 2016] Info fld=0x235f23e0
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Add. Sense: Unrecovered read error
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm] CDB:
[Sun Nov 20 15:52:10 2016] Read(10): 28 00 23 5f 23 60 00 02 30 00
[Sun Nov 20 15:52:10 2016] end_request: critical medium error, dev sdm, sector 593437664
- osd log always shows after deep-scrub,osd caught read error.
-3> 2016-11-20 16:54:39.740795 7f71f7e75700 0 log_channel(cluster) log [INF] : 13.7e9 deep-scrub starts
-2> 2016-11-20 16:54:41.958706 7f71f7e75700 0 log_channel(cluster) log [INF] : 13.7e9 deep-scrub ok
-1> 2016-11-20 16:54:48.740180 7f71f7e75700 0 log_channel(cluster) log [INF] : 13.5c9 deep-scrub starts
0> 2016-11-20 16:55:00.704106 7f71f7e75700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferl
ist&, uint32_t, bool)' thread 7f71f7e75700 time 2016-11-20 16:55:00.699763
os/FileStore.cc: 2850: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f7228bad78b]
2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xc58) [0x7f722898b718]
3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x2f9) [0x7f7228a17279]
4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2c8) [0x7f72289510a8]
5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7f7228869eea]
6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x480) [0x7f7228870100]
7: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f72288717ee]
8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x7f7228756069]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x7f7228b9e376]
10: (ThreadPool::WorkThread::entry()+0x10) [0x7f7228b9f420]
11: (()+0x8182) [0x7f72279ab182]
12: (clone()+0x6d) [0x7f7225f1647d]
- megacli showed medium error count.
Enclosure Device ID: 32
Slot Number: 15
Device Id: 15
Sequence Number: 2
Media Error Count: 9
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.090 TB [0x8bba0cb0 Sectors]
Non Coerced Size: 1.090 TB [0x8baa0cb0 Sectors]
Coerced Size: 1.090 TB [0x8ba80000 Sectors]
Firmware state: JBOD
SAS Address(0): 0x5000c50084f2971d
SAS Address(1): 0x0
Connected Port Number: 0(path0)
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com