Re: Read Errors and OSD Flapping

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 2 Jun 2015 10:33:36 -0700

On Sat, May 30, 2015 at 2:23 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>
> Hi All,
>
>
>
> I was noticing poor performance on my cluster and when I went to investigate I noticed OSD 29 was flapping up and down. On investigation it looks like it has 2 pending sectors, kernel log is filled with the following
>
>
>
> end_request: critical medium error, dev sdk, sector 4483365656
>
> end_request: critical medium error, dev sdk, sector 4483365872
>
>
>
> I can see in the OSD logs that it looked like when the OSD was crashing it was trying to scrub the PG, probably failing when the kernel passes up the read error.
>
>
>
> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>
> 1: /usr/bin/ceph-osd() [0xacaf4a]
>
> 2: (()+0x10340) [0x7fdc43032340]
>
> 3: (gsignal()+0x39) [0x7fdc414d1cc9]
>
> 4: (abort()+0x148) [0x7fdc414d50d8]
>
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fdc41ddc6b5]
>
> 6: (()+0x5e836) [0x7fdc41dda836]
>
> 7: (()+0x5e863) [0x7fdc41dda863]
>
> 8: (()+0x5eaa2) [0x7fdc41ddaaa2]
>
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc2908]
>
> 10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xc98) [0x9168e
>
> 8]
>
> 11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x2f9) [0xa05bf9]
>
> 12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPH
>
> andle&)+0x2c8) [0x8dab98]
>
> 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7f099a]
>
> 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4a2) [0x7f1132]
>
> 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbe) [0x6e583e]
>
> 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb38ae]
>
> 17: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
>
> 18: (()+0x8182) [0x7fdc4302a182]
>
> 19: (clone()+0x6d) [0x7fdc4159547d]
>
>
>
> Few questions:
>
> 1.       Is this the expected behaviour, or should Ceph try and do something to either keep the OSD down or rewrite the sector to cause a sector remap?

So the OSD is committing suicide and we want it to stay dead. But the
init system is restarting it. We are actually discussing how that
should change right now, but aren't quite sure what the right settings
are: http://tracker.ceph.com/issues/11798

Presuming you still have the logs, how long was the cycle time for it
to suicide, restart, and suicide again?

>
> 2.       I am monitoring smart stats, but is there any other way of picking this up or getting Ceph to highlight it? Something like a flapping OSD notification would be nice.
>
> 3.       I’m assuming at this stage this disk will not be replaceable under warranty, am I best to mark it as out, let it drain and then re-introduce it again, which should overwrite the sector and cause a remap? Or is there a better way?

I'm not really sure about these ones. I imagine most users are
covering it via nagios monitoring of the processes themselves?
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com