Re: replacing an OSD or crush map sensitivity

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/06/2013 9:16 AM, Chen, Xiaoxi wrote:
> my 0.02, you really dont need to wait for health_ok between your
> recovery steps,just go ahead. Everytime a new map be generated and
> broadcasted,the old map and in-progress recovery will be canceled

thanks Xiaoxi, that is helpful to know.

It seems to me that there might be a failure-mode (or race-condition?)
here though, as the cluster is now struggling to recover as the
replacement OSD caused the cluster to go into backfill_toofull.

The failure sequence might be:

1. From HEALTH_OK crash an OSD
2. Wait for recovery
3. Remove OSD using usual procedures
4. Wait for recovery
5. Add back OSD using usual procedures
6. Wait for recovery
7. Cluster is unable to recover due to toofull conditions

Perhaps this is a needed test case to round-trip a cluster through a
known failure/recovery scenario.

Note this is using a simplistically configured test-cluster with CephFS
in the mix and about 2.5 million files.

Something else I noticed: I restarted the cluster (and set the leveldb
compact option since I'd run out of space on the roots) and now I see it
is again making progress on the backfill. Seems odd that the cluster
pauses but a restart clears the pause, is that by design?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux