On 01/08/17 12:41, Osama Hasebou wrote: > Hi, > > What would be the best possible and efficient way for big Ceph clusters when maintenance needs to be performed ? > > Lets say that we have 3 copies of data, and one of the servers needs to be maintained, and maintenance might take 1-2 days due to some unprepared issues that come up. > > Setting node to no-out is a bit of risk since only 2 copies will be active. So in that case what would be proper way taking down node to rebalance and then perform maintanence , and in case one needs to being it back online without rebalancing right away to check if its functioning properly or not as a server 1st and once all looks good, one can introduce rebalancing again ? > > > Thank you. > > Regards, > Ossi The recommended practice would be to use "ceph osd crush reweight" to set the crush weight on the OSDs that will be down to 0. The cluster will then rebalance, and once it's HEALTH_OK again, you can take those OSDs offline without losing any redundancy (though you will need to ensure you have enough spare space in what's left of the cluster that you don't push disk usage too high on your other nodes). When you're ready to bring them online again, make sure that you have "osd_crush_update_on_start = false" set in your ceph.conf so they don't potentially mess with their weights when they come back. Then they will be up but still at crush weight 0 so no data will be assigned to them. When you're happy everything's okay, use "ceph osd crush reweight" again to bring them back to their original weights. Lots of people like to do that in increments of 0.1 weight at a time, so the recovery is staggered and doesn't impact your active I/O too much. This assumes your crush layout is such that you can still have three replicas with one server missing. Rich
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com