Re: Read Errors and OSD Flapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> > Few questions:
> >
> > 1.       Is this the expected behaviour, or should Ceph try and do
> > something to either keep the OSD down or rewrite the sector to cause a
> > sector remap?
> >
> I guess what you see is what you get, but both things, especially the rewrite
> would be better.
> Alas I suppose it is a bit of work for it to do the right thing there (getting the
> replica to rewrite things with from another node) AND to be certain that this
> wasn't the last good replica, read error or not.

Agreed, it's probably best Ceph doesn't do something unless its 100% sure it has the correct data before overwriting. But would be really nice if something could be done. 

> 
> > 2.       I am monitoring smart stats, but is there any other way of
> > picking this up or getting Ceph to highlight it? Something like a
> > flapping OSD notification would be nice.
> >
> Lots of improvement opportunities in the Ceph status indeed.
> Starting with what constitutes which level (ERR, WRN, INF).

Or maybe a counter somewhere that monitors read errors, this could help with #1 where Ceph could say if I've tried 10 times to read with no luck then overwrite/delete

> 
> > 3.       I'm assuming at this stage this disk will not be replaceable
> > under warranty, am I best to mark it as out, let it drain and then
> > re-introduce it again, which should overwrite the sector and cause a
> > remap? Or is there a better way?
> >
> That's the safe, easy way. Might want to add a dd zeroing the drive and long
> SMART test afterwards for good measure before re-adding it.
> 
> A faster way might be to determine which PG, file is affected just rewrite
> this, preferably even with a good copy of the data.
> After that a deep-scrub of that PG, potentially doing a manual repair if this
> was the acting one.

Thanks for the suggestions. I will introduce the disk 1st and see if the smart stats change from pending sectors to reallocated, if they don't then I will do the DD and smart test. It will be a good test as to what to do in this situation as I have a feeling this will most likely happen again.

> 
> Christian
> >
> >
> > Many Thanks,
> >
> > Nick
> >
> >
> >
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux