On 8/28/2014 4:17 PM, Craig Lewis wrote: > My initial experience was similar to Mike's, causing a similar level of > paranoia. :-) I'm dealing with RadosGW though, so I can tolerate > higher latencies. > > I was running my cluster with noout and nodown set for weeks at a time. I'm sure Craig will agree, but wanted to add this for other readers: I find value in the noout flag for temporary intervention, but prefer to set "mon osd down out interval" for dealing with events that may occur in the future to give an operator time to intervene. The nodown flag is another beast altogether. The nodown flag tends to be *a bad thing* when attempting to provide reliable client io. For our use case, we want OSDs to be marked down quickly if they are in fact unavailable for any reason, so client io doesn't hang waiting for them. If OSDs are flapping during recovery (i.e. the "wrongly marked me down" log messages), I've found far superior results by tuning the recovery knobs than by permanently setting the nodown flag. - Mike > Recovery of a single OSD might cause other OSDs to crash. In the > primary cluster, I was always able to get it under control before it > cascaded too wide. In my secondary cluster, it did spiral out to 40% of > the OSDs, with 2-5 OSDs down at any time. > > I traced my problems to a combination of osd max backfills was too high > for my cluster, and my mkfs.xfs arguments were causing memory starvation > issues. I lowered osd max backfills, added SSD journals, > and reformatted every OSD with better mkfs.xfs arguments. Now both > clusters are stable, and I don't want to break it. > > I only have 45 OSDs, so the risk with a 24-48 hours recovery time is > acceptable to me. It will be a problem as I scale up, but scaling up > will also help with the latency problems. > > > > > On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson <mike.dawson at cloudapt.com > <mailto:mike.dawson at cloudapt.com>> wrote: > > > We use 3x replication and have drives that have relatively high > steady-state IOPS. Therefore, we tend to prioritize client-side IO > more than a reduction from 3 copies to 2 during the loss of one > disk. The disruption to client io is so great on our cluster, we > don't want our cluster to be in a recovery state without > operator-supervision. > > Letting OSDs get marked out without operator intervention was a > disaster in the early going of our cluster. For example, an OSD > daemon crash would trigger automatic recovery where it was unneeded. > Ironically, often times the unneeded recovery would often trigger > additional daemons to crash, making a bad situation worse. During > the recovery, rbd client io would often times go to 0. > > To deal with this issue, we set "mon osd down out interval = 14400", > so as operators we have 4 hours to intervene before Ceph attempts to > self-heal. When hardware is at fault, we remove the osd, replace the > drive, re-add the osd, then allow backfill to begin, thereby > completely skipping step B in your timeline above. > > - Mike > > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >