making osd removal safer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've seen a couple of cases in the last two weeks where users have removed 
OSDs from their clusters too soon and either lost data or had to scramble 
to re-add OSDs.  I'm wondering how we can improve the CLI/tools to make 
this situation harder.

In luminous, we added a few new commands:

- ceph osd destroy: zap info about an OSD but keep it's ID in place (with 
a 'destroyed' flag) so that it can be recreated with a replacement device.
- ceph osd purge: zap everything about an OSD, including the ID

Once these commands are run it is hard to re-add the OSD device back into 
the cluster if the operator realizes there are PGs stuck peering or 
unfound objects.

There are two other new commands:

- ceph osd ok-to-stop: Checks whether it looks like PGs will remain 
available even if the specified OSD(s) are stopped.
- ceph osd safe-to-destroy: Checks whether it is safe to destroy an OSD.  
This does various checks to ensure there is no data on the OSD(s), no 
unfound objects, stuck peering, and so forth.

Once could argue that the users who got into trouble should have run 
'ceph osd safe-to-destory' before removing the devices.  That should have 
avoided their problem, but we should do what we can to make it hard 
for them to *not* see the safety check.

So, two ideas:

First, let's get rid of 'ceph osd rm'.  Users should use destroy or purge; 
this command only does part of the job and has an unclear purpose 
now that the others exist.

Second, change destroy and purge to do the safe-to-destroy check.  If the 
check passes, do the removal.  If the check fails, issue a warning (with 
some detail in the error string) and require the --yes-i-really-mean-it 
option.  (Currently, these commands *always* require the force flag 
but don't perform the safety check.)

The main downside to this is that current scripts (well, post-luminous 
scripts) may include the force flag (since it used to be unconditionally 
required) and miss the safety check.  On the other hand, they are already 
ignore the safety check, so they are no worse off... they just see no 
benefit.

Thoughts?
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux