On 28/08/2014 16:29, Mike Dawson wrote: > On 8/28/2014 12:23 AM, Christian Balzer wrote: >> On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote: >> >>> >>> >>> On 27/08/2014 04:34, Christian Balzer wrote: >>>> >>>> Hello, >>>> >>>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: >>>> >>>>> Hi Craig, >>>>> >>>>> I assume the reason for the 48 hours recovery time is to keep the cost >>>>> of the cluster low ? I wrote "1h recovery time" because it is roughly >>>>> the time it would take to move 4TB over a 10Gb/s link. Could you >>>>> upgrade your hardware to reduce the recovery time to less than two >>>>> hours ? Or are there factors other than cost that prevent this ? >>>>> >>>> >>>> I doubt Craig is operating on a shoestring budget. >>>> And even if his network were to be just GbE, that would still make it >>>> only 10 hours according to your wishful thinking formula. >>>> >>>> He probably has set the max_backfills to 1 because that is the level of >>>> I/O his OSDs can handle w/o degrading cluster performance too much. >>>> The network is unlikely to be the limiting factor. >>>> >>>> The way I see it most Ceph clusters are in sort of steady state when >>>> operating normally, i.e. a few hundred VM RBD images ticking over, most >>>> actual OSD disk ops are writes, as nearly all hot objects that are >>>> being read are in the page cache of the storage nodes. >>>> Easy peasy. >>>> >>>> Until something happens that breaks this routine, like a deep scrub, >>>> all those VMs rebooting at the same time or a backfill caused by a >>>> failed OSD. Now all of a sudden client ops compete with the backfill >>>> ops, page caches are no longer hot, the spinners are seeking left and >>>> right. Pandemonium. >>>> >>>> I doubt very much that even with a SSD backed cluster you would get >>>> away with less than 2 hours for 4TB. >>>> >>>> To give you some real life numbers, I currently am building a new >>>> cluster but for the time being have only one storage node to play with. >>>> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 >>>> actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. >>>> >>>> So I took out one OSD (reweight 0 first, then the usual removal steps) >>>> because the actual disk was wonky. Replaced the disk and re-added the >>>> OSD. Both operations took about the same time, 4 minutes for >>>> evacuating the OSD (having 7 write targets clearly helped) for measly >>>> 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the >>>> OSD. And that is on one node (thus no network latency) that has the >>>> default parameters (so a max_backfill of 10) which was otherwise >>>> totally idle. >>>> >>>> In other words, in this pretty ideal case it would have taken 22 hours >>>> to re-distribute 4TB. >>> >>> That makes sense to me :-) >>> >>> When I wrote 1h, I thought about what happens when an OSD becomes >>> unavailable with no planning in advance. In the scenario you describe >>> the risk of a data loss does not increase since the objects are evicted >>> gradually from the disk being decommissioned and the number of replica >>> stays the same at all times. There is not a sudden drop in the number of >>> replica which is what I had in mind. >>> >> That may be, but I'm rather certain that there is no difference in speed >> and priority of a rebalancing caused by an OSD set to weight 0 or one >> being set out. >> >>> If the lost OSD was part of 100 PG, the other disks (let say 50 of them) >>> will start transferring a new replica of the objects they have to the >>> new OSD in their PG. The replacement will not be a single OSD although >>> nothing prevents the same OSD to be used in more than one PG as a >>> replacement for the lost one. If the cluster network is connected at >>> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new >>> duplicates do not originate from a single OSD but from at least dozens >>> of them and since they target more than one OSD, I assume we can expect >>> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to >>> account for the fact that the cluster network is never idle. >>> >>> Am I being too optimistic ? >> Vastly. >> >>> Do you see another blocking factor that >>> would significantly slow down recovery ? >>> >> As Craig and I keep telling you, the network is not the limiting factor. >> Concurrent disk IO is, as I pointed out in the other thread. > > Completely agree. > > On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster. > > Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs. > > Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results. > > Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle contention is the true limiting factor. It'll be pegged at or near 100% if spindle contention is the issue. Hi Mike, Did you by any chance also measure how long it took for the 3 replicas to be restored on all PG in which the failed disk was participating ? I assume the following sequence happened: A) The 3TB drive failed and contained ~2TB B) The cluster recovered by creating new replicas C) The new 3TB drive was installed D) Backfilling completed I'm interested in the time between A and B, i.e. when one copy is potentially lost forever, because this is when the probability of a permanent data loss increases. Although it is important to reduce the time between C and D to a minimum, it has no impact on the durability of the data. Cheers > - Mike > > >> >> Another example if you please: >> My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs. >> 1 GbE links for client and cluster respectively. >> --- >> #ceph -s >> cluster 25bb48ec-689d-4cec-8494-d1a62ca509be >> health HEALTH_OK >> monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03 >> osdmap e1206: 4 osds: 4 up, 4 in >> pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects >> 141 GB used, 2323 GB / 2464 GB avail >> 256 active+clean >> --- >> replication size is 2, in can do about 60MB/s writes with rados bench from >> a client. >> >> Setting one OSD out (the data distribution is nearly uniform) it took 12 >> minutes to recover on a completely idle (no clients connected) cluster. >> The disk utilization was 70-90%, the cluster network hovered around 20%, >> never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue. >> Given the ceph log numbers and the data size, I make this a recovery speed >> of about 40MB/s or 13MB/s per OSD. >> Better than I expected, but a far cry from what the OSDs could do >> individually if they were not flooded with concurrent read and write >> requests by the backfilling operation. >> >> Now, more disks will help, but I very much doubt that this will scale >> linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please). >> >> And this was an IDLE cluster. >> >> Doing this on a cluster with just about 10 client IOPS per OSD would be >> far worse. Never mind that people don't like their client IO to stall for >> more than a few seconds. >> >> Something that might improve this booth in terms of speed and impact to >> the clients would be something akin to the MD (linux software raid) >> recovery logic. >> As in, only one backfill operation per OSD (read or write, not both!) at >> the same time. >> >> Regards, >> >> Christian >>> Cheers >>> >>>> More in another reply. >>>> >>>>> Cheers >>>>> >>>>> On 26/08/2014 19:37, Craig Lewis wrote: >>>>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd >>>>>> max backfills = 1). I believe that increases my risk of failure by >>>>>> 48^2 . Since your numbers are failure rate per hour per disk, I need >>>>>> to consider the risk for the whole time for each disk. So more >>>>>> formally, rebuild time to the power of (replicas -1). >>>>>> >>>>>> So I'm at 2304/100,000,000, or approximately 1/43,000. That's a >>>>>> much higher risk than 1 / 10^8. >>>>>> >>>>>> >>>>>> A risk of 1/43,000 means that I'm more likely to lose data due to >>>>>> human error than disk failure. Still, I can put a small bit of >>>>>> effort in to optimize recovery speed, and lower this number. >>>>>> Managing human error is much harder. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org >>>>>> <mailto:loic at dachary.org>> wrote: >>>>>> >>>>>> Using percentages instead of numbers lead me to calculations >>>>>> errors. Here it is again using 1/100 instead of % for clarity ;-) >>>>>> >>>>>> Assuming that: >>>>>> >>>>>> * The pool is configured for three replicas (size = 3 which is >>>>>> the default) >>>>>> * It takes one hour for Ceph to recover from the loss of a single >>>>>> OSD >>>>>> * Any other disk has a 1/100,000 chance to fail within the hour >>>>>> following the failure of the first disk (assuming AFR >>>>>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk >>>>>> is 8%, divided by the number of hours during a year == (0.08 / 8760) >>>>>> ~= 1/100,000 >>>>>> * A given disk does not participate in more than 100 PG >>>>>> >>>>> >>>> >>>> >>> >> >> > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Lo?c Dachary, Artisan Logiciel Libre -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 263 bytes Desc: OpenPGP digital signature URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/1715acbc/attachment.pgp>