Best practice K/M-parameters EC pool

chibi@xxxxxxx (Christian Balzer) · Thu, 28 Aug 2014 13:23:36 +0900

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:

> 
> 
> On 27/08/2014 04:34, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
> > 
> >> Hi Craig,
> >>
> >> I assume the reason for the 48 hours recovery time is to keep the cost
> >> of the cluster low ? I wrote "1h recovery time" because it is roughly
> >> the time it would take to move 4TB over a 10Gb/s link. Could you
> >> upgrade your hardware to reduce the recovery time to less than two
> >> hours ? Or are there factors other than cost that prevent this ?
> >>
> > 
> > I doubt Craig is operating on a shoestring budget.
> > And even if his network were to be just GbE, that would still make it
> > only 10 hours according to your wishful thinking formula.
> > 
> > He probably has set the max_backfills to 1 because that is the level of
> > I/O his OSDs can handle w/o degrading cluster performance too much.
> > The network is unlikely to be the limiting factor.
> > 
> > The way I see it most Ceph clusters are in sort of steady state when
> > operating normally, i.e. a few hundred VM RBD images ticking over, most
> > actual OSD disk ops are writes, as nearly all hot objects that are
> > being read are in the page cache of the storage nodes.
> > Easy peasy.
> > 
> > Until something happens that breaks this routine, like a deep scrub,
> > all those VMs rebooting at the same time or a backfill caused by a
> > failed OSD. Now all of a sudden client ops compete with the backfill
> > ops, page caches are no longer hot, the spinners are seeking left and
> > right. Pandemonium.
> > 
> > I doubt very much that even with a SSD backed cluster you would get
> > away with less than 2 hours for 4TB.
> > 
> > To give you some real life numbers, I currently am building a new
> > cluster but for the time being have only one storage node to play with.
> > It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
> > actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
> > 
> > So I took out one OSD (reweight 0 first, then the usual removal steps)
> > because the actual disk was wonky. Replaced the disk and re-added the
> > OSD. Both operations took about the same time, 4 minutes for
> > evacuating the OSD (having 7 write targets clearly helped) for measly
> > 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
> > OSD. And that is on one node (thus no network latency) that has the
> > default parameters (so a max_backfill of 10) which was otherwise
> > totally idle. 
> > 
> > In other words, in this pretty ideal case it would have taken 22 hours
> > to re-distribute 4TB.
> 
> That makes sense to me :-) 
> 
> When I wrote 1h, I thought about what happens when an OSD becomes
> unavailable with no planning in advance. In the scenario you describe
> the risk of a data loss does not increase since the objects are evicted
> gradually from the disk being decommissioned and the number of replica
> stays the same at all times. There is not a sudden drop in the number of
> replica  which is what I had in mind.
> 
That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.

> If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
> will start transferring a new replica of the objects they have to the
> new OSD in their PG. The replacement will not be a single OSD although
> nothing prevents the same OSD to be used in more than one PG as a
> replacement for the lost one. If the cluster network is connected at
> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
> duplicates do not originate from a single OSD but from at least dozens
> of them and since they target more than one OSD, I assume we can expect
> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
> account for the fact that the cluster network is never idle.
> 
> Am I being too optimistic ? 
Vastly.

> Do you see another blocking factor that
> would significantly slow down recovery ?
> 
As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.

Another example if you please:
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs. 
1 GbE links for client and cluster respectively.
---
#ceph -s
    cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
     health HEALTH_OK
     monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
     osdmap e1206: 4 osds: 4 up, 4 in
      pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
            141 GB used, 2323 GB / 2464 GB avail
                 256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.

Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 70-90%, the cluster network hovered around 20%,
never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
Given the ceph log numbers and the data size, I make this a recovery speed
of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation. 

Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).

And this was an IDLE cluster.

Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall for
more than a few seconds.

Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic. 
As in, only one backfill operation per OSD (read or write, not both!) at
the same time.

Regards,

Christian
> Cheers
> 
> > More in another reply.
> > 
> >> Cheers
> >>
> >> On 26/08/2014 19:37, Craig Lewis wrote:
> >>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
> >>> max backfills = 1).   I believe that increases my risk of failure by
> >>> 48^2 .  Since your numbers are failure rate per hour per disk, I need
> >>> to consider the risk for the whole time for each disk.  So more
> >>> formally, rebuild time to the power of (replicas -1).
> >>>
> >>> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a
> >>> much higher risk than 1 / 10^8.
> >>>
> >>>
> >>> A risk of 1/43,000 means that I'm more likely to lose data due to
> >>> human error than disk failure.  Still, I can put a small bit of
> >>> effort in to optimize recovery speed, and lower this number.
> >>> Managing human error is much harder.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
> >>> <mailto:loic at dachary.org>> wrote:
> >>>
> >>>     Using percentages instead of numbers lead me to calculations
> >>> errors. Here it is again using 1/100 instead of % for clarity ;-)
> >>>
> >>>     Assuming that:
> >>>
> >>>     * The pool is configured for three replicas (size = 3 which is
> >>> the default)
> >>>     * It takes one hour for Ceph to recover from the loss of a single
> >>> OSD
> >>>     * Any other disk has a 1/100,000 chance to fail within the hour
> >>> following the failure of the first disk (assuming AFR
> >>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
> >>> is 8%, divided by the number of hours during a year == (0.08 / 8760)
> >>> ~= 1/100,000
> >>>     * A given disk does not participate in more than 100 PG
> >>>
> >>
> > 
> > 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/