Best practice K/M-parameters EC pool

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Thu, 28 Aug 2014 13:17:34 -0700

My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate higher
latencies.

I was running my cluster with noout and nodown set for weeks at a time.
 Recovery of a single OSD might cause other OSDs to crash.  In the primary
cluster, I was always able to get it under control before it cascaded too
wide.  In my secondary cluster, it did spiral out to 40% of the OSDs, with
2-5 OSDs down at any time.

I traced my problems to a combination of osd max backfills was too high for
my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals, and reformatted
every OSD with better mkfs.xfs arguments.  Now both clusters are stable,
and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up will
also help with the latency problems.

On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson <mike.dawson at cloudapt.com>
wrote:

>
> We use 3x replication and have drives that have relatively high
> steady-state IOPS. Therefore, we tend to prioritize client-side IO more
> than a reduction from 3 copies to 2 during the loss of one disk. The
> disruption to client io is so great on our cluster, we don't want our
> cluster to be in a recovery state without operator-supervision.
>
> Letting OSDs get marked out without operator intervention was a disaster
> in the early going of our cluster. For example, an OSD daemon crash would
> trigger automatic recovery where it was unneeded. Ironically, often times
> the unneeded recovery would often trigger additional daemons to crash,
> making a bad situation worse. During the recovery, rbd client io would
> often times go to 0.
>
> To deal with this issue, we set "mon osd down out interval = 14400", so as
> operators we have 4 hours to intervene before Ceph attempts to self-heal.
> When hardware is at fault, we remove the osd, replace the drive, re-add the
> osd, then allow backfill to begin, thereby completely skipping step B in
> your timeline above.
>
> - Mike
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/09750cac/attachment.htm>