Re: Read Errors and OSD Flapping

Christian Balzer <chibi@xxxxxxx> · Sun, 31 May 2015 10:49:24 +0900

Hello,

On Sat, 30 May 2015 22:23:22 +0100 Nick Fisk wrote:

> Hi All,
> 
>  
> 
> I was noticing poor performance on my cluster and when I went to
> investigate I noticed OSD 29 was flapping up and down. On investigation
> it looks like it has 2 pending sectors, kernel log is filled with the
> following
> 
>  
> 
> end_request: critical medium error, dev sdk, sector 4483365656
> 
> end_request: critical medium error, dev sdk, sector 4483365872
> 
>  
> 
> I can see in the OSD logs that it looked like when the OSD was crashing
> it was trying to scrub the PG, probably failing when the kernel passes
> up the read error. 
> 
>  
> 
> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> 
> 1: /usr/bin/ceph-osd() [0xacaf4a]
> 
> 2: (()+0x10340) [0x7fdc43032340]
> 
> 3: (gsignal()+0x39) [0x7fdc414d1cc9]
> 
> 4: (abort()+0x148) [0x7fdc414d50d8]
> 
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fdc41ddc6b5]
> 
> 6: (()+0x5e836) [0x7fdc41dda836]
> 
> 7: (()+0x5e863) [0x7fdc41dda863]
> 
> 8: (()+0x5eaa2) [0x7fdc41ddaaa2]
> 
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x278) [0xbc2908]
> 
> 10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
> long, ceph::buffer::list&, unsigned int, bool)+0xc98) [0x9168e
> 
> 8]
> 
> 11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
> ScrubMap::object&, ThreadPool::TPHandle&)+0x2f9) [0xa05bf9]
> 
> 12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
> std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPH
> 
> andle&)+0x2c8) [0x8dab98]
> 
> 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
> unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7f099a]
> 
> 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4a2)
> [0x7f1132]
> 
> 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*,
> ThreadPool::TPHandle&)+0xbe) [0x6e583e]
> 
> 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb38ae]
> 
> 17: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> 
> 18: (()+0x8182) [0x7fdc4302a182]
> 
> 19: (clone()+0x6d) [0x7fdc4159547d]
> 
>  
> 
> Few questions: 
> 
> 1.       Is this the expected behaviour, or should Ceph try and do
> something to either keep the OSD down or rewrite the sector to cause a
> sector remap?
> 
I guess what you see is what you get, but both things, especially the
rewrite would be better.
Alas I suppose it is a bit of work for it to do the right thing there
(getting the replica to rewrite things with from another node) AND to be
certain that this wasn't the last good replica, read error or not. 

> 2.       I am monitoring smart stats, but is there any other way of
> picking this up or getting Ceph to highlight it? Something like a
> flapping OSD notification would be nice.
> 
Lots of improvement opportunities in the Ceph status indeed. 
Starting with what constitutes which level (ERR, WRN, INF).

> 3.       I'm assuming at this stage this disk will not be replaceable
> under warranty, am I best to mark it as out, let it drain and then
> re-introduce it again, which should overwrite the sector and cause a
> remap? Or is there a better way?
>
That's the safe, easy way. Might want to add a dd zeroing the drive and
long SMART test afterwards for good measure before re-adding it.

A faster way might be to determine which PG, file is affected just rewrite
this, preferably even with a good copy of the data. 
After that a deep-scrub of that PG, potentially doing a manual repair if
this was the acting one.

Christian
>  
> 
> Many Thanks,
> 
> Nick
> 
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com