Best practice K/M-parameters EC pool

chibi@xxxxxxx (Christian Balzer) · Fri, 29 Aug 2014 00:23:40 +0900

On Thu, 28 Aug 2014 10:29:20 -0400 Mike Dawson wrote:

> On 8/28/2014 12:23 AM, Christian Balzer wrote:
> > On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
> >
> >>
> >>
> >> On 27/08/2014 04:34, Christian Balzer wrote:
> >>>
> >>> Hello,
> >>>
> >>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
> >>>
> >>>> Hi Craig,
> >>>>
> >>>> I assume the reason for the 48 hours recovery time is to keep the
> >>>> cost of the cluster low ? I wrote "1h recovery time" because it is
> >>>> roughly the time it would take to move 4TB over a 10Gb/s link.
> >>>> Could you upgrade your hardware to reduce the recovery time to less
> >>>> than two hours ? Or are there factors other than cost that prevent
> >>>> this ?
> >>>>
> >>>
> >>> I doubt Craig is operating on a shoestring budget.
> >>> And even if his network were to be just GbE, that would still make it
> >>> only 10 hours according to your wishful thinking formula.
> >>>
> >>> He probably has set the max_backfills to 1 because that is the level
> >>> of I/O his OSDs can handle w/o degrading cluster performance too
> >>> much. The network is unlikely to be the limiting factor.
> >>>
> >>> The way I see it most Ceph clusters are in sort of steady state when
> >>> operating normally, i.e. a few hundred VM RBD images ticking over,
> >>> most actual OSD disk ops are writes, as nearly all hot objects that
> >>> are being read are in the page cache of the storage nodes.
> >>> Easy peasy.
> >>>
> >>> Until something happens that breaks this routine, like a deep scrub,
> >>> all those VMs rebooting at the same time or a backfill caused by a
> >>> failed OSD. Now all of a sudden client ops compete with the backfill
> >>> ops, page caches are no longer hot, the spinners are seeking left and
> >>> right. Pandemonium.
> >>>
> >>> I doubt very much that even with a SSD backed cluster you would get
> >>> away with less than 2 hours for 4TB.
> >>>
> >>> To give you some real life numbers, I currently am building a new
> >>> cluster but for the time being have only one storage node to play
> >>> with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs
> >>> and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
> >>>
> >>> So I took out one OSD (reweight 0 first, then the usual removal
> >>> steps) because the actual disk was wonky. Replaced the disk and
> >>> re-added the OSD. Both operations took about the same time, 4
> >>> minutes for evacuating the OSD (having 7 write targets clearly
> >>> helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/
> >>> for refilling the OSD. And that is on one node (thus no network
> >>> latency) that has the default parameters (so a max_backfill of 10)
> >>> which was otherwise totally idle.
> >>>
> >>> In other words, in this pretty ideal case it would have taken 22
> >>> hours to re-distribute 4TB.
> >>
> >> That makes sense to me :-)
> >>
> >> When I wrote 1h, I thought about what happens when an OSD becomes
> >> unavailable with no planning in advance. In the scenario you describe
> >> the risk of a data loss does not increase since the objects are
> >> evicted gradually from the disk being decommissioned and the number
> >> of replica stays the same at all times. There is not a sudden drop in
> >> the number of replica  which is what I had in mind.
> >>
> > That may be, but I'm rather certain that there is no difference in
> > speed and priority of a rebalancing caused by an OSD set to weight 0
> > or one being set out.
> >
> >> If the lost OSD was part of 100 PG, the other disks (let say 50 of
> >> them) will start transferring a new replica of the objects they have
> >> to the new OSD in their PG. The replacement will not be a single OSD
> >> although nothing prevents the same OSD to be used in more than one PG
> >> as a replacement for the lost one. If the cluster network is
> >> connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s.
> >> Since the new duplicates do not originate from a single OSD but from
> >> at least dozens of them and since they target more than one OSD, I
> >> assume we can expect an actual throughput of 5Gb/s. I should have
> >> written 2h instead of 1h to account for the fact that the cluster
> >> network is never idle.
> >>
> >> Am I being too optimistic ?
> > Vastly.
> >
> >> Do you see another blocking factor that
> >> would significantly slow down recovery ?
> >>
> > As Craig and I keep telling you, the network is not the limiting
> > factor. Concurrent disk IO is, as I pointed out in the other thread.
> 
> Completely agree.
> 
Thank you for that voice of reason, backing things up by a real life
sizable cluster. ^o^

> On a production cluster with OSDs backed by spindles, even with OSD 
> journals on SSDs, it is insufficient to calculate single-disk 
> replacement backfill time based solely on network throughput. IOPS will 
> likely be the limiting factor when backfilling a single failed spinner 
> in a production cluster.
> 
> Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
> cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio 
> of 3:1), with dual 1GbE bonded NICs.
> 
You're generous with your SSDs. ^o^

> Using the only throughput math, backfill could have theoretically 
> completed in a bit over 2.5 hours, but it actually took 15 hours. I've 
> done this a few times with similar results.
> 
And that makes it about 40MB/s. Similar to what Craig is seeing with
increased backfills and what I speculated from my tests. ^o^

> Why? Spindle contention on the replacement drive. Graph the '%util' 
> metric from something like 'iostat -xt 2' during a single disk backfill 
> to get a very clear view that spindle contention is the true limiting 
> factor. It'll be pegged at or near 100% if spindle contention is the
> issue.
>
Precisely. 

Along those lines I give you:
http://www.engadget.com/2014/08/26/seagate-8tb-hard-drive/

Which also makes me smirk, because the people telling me that a RAID
backed OSD is bad often cite the size of it and that it will take ages to
backfill (if it were to fail in the first place and if one hadn't set the
configuration that such a failure would result in an automatic
re-balancing).
Because nothing I have deployed in that fashion or would consider to do so
is more that 3 times the size of that single disk.

Christian

> - Mike
> 
> 
> >
> > Another example if you please:
> > My shitty test cluster, 4 nodes, one OSD each, journal on disk, no
> > SSDs. 1 GbE links for client and cluster respectively.
> > ---
> > #ceph -s
> >      cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
> >       health HEALTH_OK
> >       monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch
> > 1, quorum 0 irt03 osdmap e1206: 4 osds: 4 up, 4 in
> >        pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
> >              141 GB used, 2323 GB / 2464 GB avail
> >                   256 active+clean
> > ---
> > replication size is 2, in can do about 60MB/s writes with rados bench
> > from a client.
> >
> > Setting one OSD out (the data distribution is nearly uniform) it took
> > 12 minutes to recover on a completely idle (no clients connected)
> > cluster. The disk utilization was 70-90%, the cluster network hovered
> > around 20%, never exceeding 35% on the 3 "surviving" nodes. CPU was
> > never an issue. Given the ceph log numbers and the data size, I make
> > this a recovery speed of about 40MB/s or 13MB/s per OSD.
> > Better than I expected, but a far cry from what the OSDs could do
> > individually if they were not flooded with concurrent read and write
> > requests by the backfilling operation.
> >
> > Now, more disks will help, but I very much doubt that this will scale
> > linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong
> > please).
> >
> > And this was an IDLE cluster.
> >
> > Doing this on a cluster with just about 10 client IOPS per OSD would be
> > far worse. Never mind that people don't like their client IO to stall
> > for more than a few seconds.
> >
> > Something that might improve this booth in terms of speed and impact to
> > the clients would be something akin to the MD (linux software raid)
> > recovery logic.
> > As in, only one backfill operation per OSD (read or write, not both!)
> > at the same time.
> >
> > Regards,
> >
> > Christian
> >> Cheers
> >>
> >>> More in another reply.
> >>>
> >>>> Cheers
> >>>>
> >>>> On 26/08/2014 19:37, Craig Lewis wrote:
> >>>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full,
> >>>>> osd max backfills = 1).   I believe that increases my risk of
> >>>>> failure by 48^2 .  Since your numbers are failure rate per hour
> >>>>> per disk, I need to consider the risk for the whole time for each
> >>>>> disk.  So more formally, rebuild time to the power of (replicas
> >>>>> -1).
> >>>>>
> >>>>> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a
> >>>>> much higher risk than 1 / 10^8.
> >>>>>
> >>>>>
> >>>>> A risk of 1/43,000 means that I'm more likely to lose data due to
> >>>>> human error than disk failure.  Still, I can put a small bit of
> >>>>> effort in to optimize recovery speed, and lower this number.
> >>>>> Managing human error is much harder.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
> >>>>> <mailto:loic at dachary.org>> wrote:
> >>>>>
> >>>>>      Using percentages instead of numbers lead me to calculations
> >>>>> errors. Here it is again using 1/100 instead of % for clarity ;-)
> >>>>>
> >>>>>      Assuming that:
> >>>>>
> >>>>>      * The pool is configured for three replicas (size = 3 which is
> >>>>> the default)
> >>>>>      * It takes one hour for Ceph to recover from the loss of a
> >>>>> single OSD
> >>>>>      * Any other disk has a 1/100,000 chance to fail within the
> >>>>> hour following the failure of the first disk (assuming AFR
> >>>>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
> >>>>> is 8%, divided by the number of hours during a year == (0.08 /
> >>>>> 8760) ~= 1/100,000
> >>>>>      * A given disk does not participate in more than 100 PG
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/