> > Few questions: > > > > 1. Is this the expected behaviour, or should Ceph try and do > > something to either keep the OSD down or rewrite the sector to cause a > > sector remap? > > > I guess what you see is what you get, but both things, especially the rewrite > would be better. > Alas I suppose it is a bit of work for it to do the right thing there (getting the > replica to rewrite things with from another node) AND to be certain that this > wasn't the last good replica, read error or not. Agreed, it's probably best Ceph doesn't do something unless its 100% sure it has the correct data before overwriting. But would be really nice if something could be done. > > > 2. I am monitoring smart stats, but is there any other way of > > picking this up or getting Ceph to highlight it? Something like a > > flapping OSD notification would be nice. > > > Lots of improvement opportunities in the Ceph status indeed. > Starting with what constitutes which level (ERR, WRN, INF). Or maybe a counter somewhere that monitors read errors, this could help with #1 where Ceph could say if I've tried 10 times to read with no luck then overwrite/delete > > > 3. I'm assuming at this stage this disk will not be replaceable > > under warranty, am I best to mark it as out, let it drain and then > > re-introduce it again, which should overwrite the sector and cause a > > remap? Or is there a better way? > > > That's the safe, easy way. Might want to add a dd zeroing the drive and long > SMART test afterwards for good measure before re-adding it. > > A faster way might be to determine which PG, file is affected just rewrite > this, preferably even with a good copy of the data. > After that a deep-scrub of that PG, potentially doing a manual repair if this > was the acting one. Thanks for the suggestions. I will introduce the disk 1st and see if the smart stats change from pending sectors to reallocated, if they don't then I will do the DD and smart test. It will be a good test as to what to do in this situation as I have a feeling this will most likely happen again. > > Christian > > > > > > Many Thanks, > > > > Nick > > > > > > > > > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Fusion Communications > http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com