ceph osd down

马忠明 <manian1987@xxxxxxx> · Sun, 20 Nov 2016 19:16:21 +0800 (CST)

Hi guys,
So our cluster always got osd down due to medium error.Our current action plan is to replace the defective disk drive.But I was wondering whether it's too sensitive for ceph to take it down.Or whether our action plan was too simple and crude.Any advice for this issue will be appreciated.

medium error from dmesg:
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Sense Key : Medium Error [current]
[Sun Nov 20 15:52:10 2016] Info fld=0x235f23e0
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Add. Sense: Unrecovered read error
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm] CDB:
[Sun Nov 20 15:52:10 2016] Read(10): 28 00 23 5f 23 60 00 02 30 00
[Sun Nov 20 15:52:10 2016] end_request: critical medium error, dev sdm, sector 593437664

osd log always shows after deep-scrub,osd caught read error.
  -3> 2016-11-20 16:54:39.740795 7f71f7e75700  0 log_channel(cluster) log [INF] : 13.7e9 deep-scrub starts
    -2> 2016-11-20 16:54:41.958706 7f71f7e75700  0 log_channel(cluster) log [INF] : 13.7e9 deep-scrub ok
    -1> 2016-11-20 16:54:48.740180 7f71f7e75700  0 log_channel(cluster) log [INF] : 13.5c9 deep-scrub starts
     0> 2016-11-20 16:55:00.704106 7f71f7e75700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferl
ist&, uint32_t, bool)' thread 7f71f7e75700 time 2016-11-20 16:55:00.699763
os/FileStore.cc: 2850: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f7228bad78b]
 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xc58) [0x7f722898b718]
 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x2f9) [0x7f7228a17279]
 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2c8) [0x7f72289510a8]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7f7228869eea]
 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x480) [0x7f7228870100]
 7: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f72288717ee]
 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x7f7228756069]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x7f7228b9e376]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x7f7228b9f420]
 11: (()+0x8182) [0x7f72279ab182]
 12: (clone()+0x6d) [0x7f7225f1647d]

megacli showed medium error count.
Enclosure Device ID: 32
Slot Number: 15
Device Id: 15
Sequence Number: 2
Media Error Count: 9
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.090 TB [0x8bba0cb0 Sectors]
Non Coerced Size: 1.090 TB [0x8baa0cb0 Sectors]
Coerced Size: 1.090 TB [0x8ba80000 Sectors]
Firmware state: JBOD
SAS Address(0): 0x5000c50084f2971d
SAS Address(1): 0x0
Connected Port Number: 0(path0) 

 _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com