Hey folks, This is either a feature request, or a request for guidance to handle something that must be common... =) I have a cluster with dozens of OSDs, and one started having read errors (media errors) from the hard disk. Ceph complained, I took it out of service my marking it down and out. "ceph osd tree" showed it as down, with a weight of 0 (out). Perfect. In the meantime, I RMA'd the disk. The replacement is on-hand, but we haven't done the swap-out yet. Woohoo, rot in place. =) Fast forward a few days, and we had a server failure. This took a bunch of OSDs with it, but we were able to bring it back online, but not before before normal recovery operations had started. The failed server came back up, and things started to migrate *back*. All this is normal. However, the loads were pretty intense, and I actually saw a few OSDs on *other* servers fail. Seemingly randomly. Only 3 or 4. Thankfully I was watching for that, and restarted them before hitting the default 5 minute timeout and kicking off *more* recovery. On to my question... During this time where I was watching for newly down OSDs, I had no way of knowing which OSDs were newly down (and potentially out), and which was the one I had set down on purpose. At least not from the CLI. I figured it out from some notes I had taken when I RMA'd the drive, but (sheepishly) not before I tried restarting the OSD that had a bad hard drive behind it. So, from the CLI, how could one distinguish OSDs that are down *on purpose* and should be left that way? My first thought would be to allow for a "note" field to be attached to an OSD, and have that displayed in the output of "ceph osd tree". If anyone is familiar with HPC and specifically PBS (pbsnodes command, specifically), this would be similar to "pbsnodes -ln", which shows notes attached to compute nodes that an administrator might have attached to compute nodes that are down. Examples I see from this on one of our current compute clusters are "bad RAM", "bad scratch disk", "does not POST", etc. Anyone else want to be able to track such a thing? Is there an existing method I could achieve such a goal with? As things scale to hundreds of OSDs are more, seems like a useful thing to note OSDs that have failed, and why. Thanks, - Travis _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com