distinguish administratively down OSDs

Travis Rhoden <trhoden@xxxxxxxxx> · Mon, 13 May 2013 17:46:57 -0400

Hey folks,

This is either a feature request, or a request for guidance to handle
something that must be common...  =)

I have a cluster with dozens of OSDs, and one started having read
errors (media errors) from the hard disk.  Ceph complained, I took it
out of service my marking it down and out.  "ceph osd tree" showed it
as down, with a weight of 0 (out).  Perfect.  In the meantime, I RMA'd
the disk.  The replacement is on-hand, but we haven't done the
swap-out yet.  Woohoo, rot in place.  =)

Fast forward a few days, and we had a server failure.  This took a
bunch of OSDs with it, but we were able to bring it back online, but
not before before normal recovery operations had started. The failed
server came back up, and things started to migrate *back*.  All this
is normal.  However, the loads were pretty intense, and I actually saw
a few OSDs on *other* servers fail.  Seemingly randomly.  Only 3 or 4.
 Thankfully I was watching for that, and restarted them before hitting
the default 5 minute timeout and kicking off *more* recovery.

On to my question...  During this time where I was watching for newly
down OSDs, I had no way of knowing which OSDs were newly down (and
potentially out), and which was the one I had set down on purpose.  At
least not from the CLI.  I figured it out from some notes I had taken
when I RMA'd the drive, but (sheepishly) not before I tried restarting
the OSD that had a bad hard drive behind it.

So, from the CLI, how could one distinguish OSDs that are down *on
purpose* and should be left that way?

My first thought would be to allow for a "note" field to be attached
to an OSD, and have that displayed in the output of "ceph osd tree".
If anyone is familiar with HPC and specifically PBS (pbsnodes command,
specifically), this would be similar to "pbsnodes -ln", which shows
notes attached to compute nodes that an administrator might have
attached to compute nodes that are down.  Examples I see from this on
one of our current compute clusters are "bad RAM", "bad scratch disk",
"does not POST", etc.

Anyone else want to be able to track such a thing?  Is there an
existing method I could achieve such a goal with?  As things scale to
hundreds of OSDs are more, seems like a useful thing to note OSDs that
have failed, and why.

Thanks,

 - Travis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com