On 28/08/2014 06:23, Christian Balzer wrote: > On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote: > >> >> >> On 27/08/2014 04:34, Christian Balzer wrote: >>> >>> Hello, >>> >>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: >>> >>>> Hi Craig, >>>> >>>> I assume the reason for the 48 hours recovery time is to keep the cost >>>> of the cluster low ? I wrote "1h recovery time" because it is roughly >>>> the time it would take to move 4TB over a 10Gb/s link. Could you >>>> upgrade your hardware to reduce the recovery time to less than two >>>> hours ? Or are there factors other than cost that prevent this ? >>>> >>> >>> I doubt Craig is operating on a shoestring budget. >>> And even if his network were to be just GbE, that would still make it >>> only 10 hours according to your wishful thinking formula. >>> >>> He probably has set the max_backfills to 1 because that is the level of >>> I/O his OSDs can handle w/o degrading cluster performance too much. >>> The network is unlikely to be the limiting factor. >>> >>> The way I see it most Ceph clusters are in sort of steady state when >>> operating normally, i.e. a few hundred VM RBD images ticking over, most >>> actual OSD disk ops are writes, as nearly all hot objects that are >>> being read are in the page cache of the storage nodes. >>> Easy peasy. >>> >>> Until something happens that breaks this routine, like a deep scrub, >>> all those VMs rebooting at the same time or a backfill caused by a >>> failed OSD. Now all of a sudden client ops compete with the backfill >>> ops, page caches are no longer hot, the spinners are seeking left and >>> right. Pandemonium. >>> >>> I doubt very much that even with a SSD backed cluster you would get >>> away with less than 2 hours for 4TB. >>> >>> To give you some real life numbers, I currently am building a new >>> cluster but for the time being have only one storage node to play with. >>> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 >>> actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. >>> >>> So I took out one OSD (reweight 0 first, then the usual removal steps) >>> because the actual disk was wonky. Replaced the disk and re-added the >>> OSD. Both operations took about the same time, 4 minutes for >>> evacuating the OSD (having 7 write targets clearly helped) for measly >>> 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the >>> OSD. And that is on one node (thus no network latency) that has the >>> default parameters (so a max_backfill of 10) which was otherwise >>> totally idle. >>> >>> In other words, in this pretty ideal case it would have taken 22 hours >>> to re-distribute 4TB. >> >> That makes sense to me :-) >> >> When I wrote 1h, I thought about what happens when an OSD becomes >> unavailable with no planning in advance. In the scenario you describe >> the risk of a data loss does not increase since the objects are evicted >> gradually from the disk being decommissioned and the number of replica >> stays the same at all times. There is not a sudden drop in the number of >> replica which is what I had in mind. >> > That may be, but I'm rather certain that there is no difference in speed > and priority of a rebalancing caused by an OSD set to weight 0 or one > being set out. > >> If the lost OSD was part of 100 PG, the other disks (let say 50 of them) >> will start transferring a new replica of the objects they have to the >> new OSD in their PG. The replacement will not be a single OSD although >> nothing prevents the same OSD to be used in more than one PG as a >> replacement for the lost one. If the cluster network is connected at >> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new >> duplicates do not originate from a single OSD but from at least dozens >> of them and since they target more than one OSD, I assume we can expect >> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to >> account for the fact that the cluster network is never idle. >> >> Am I being too optimistic ? > Vastly. > >> Do you see another blocking factor that >> would significantly slow down recovery ? >> > As Craig and I keep telling you, the network is not the limiting factor. > Concurrent disk IO is, as I pointed out in the other thread. > > Another example if you please: > My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs. > 1 GbE links for client and cluster respectively. > --- > #ceph -s > cluster 25bb48ec-689d-4cec-8494-d1a62ca509be > health HEALTH_OK > monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03 > osdmap e1206: 4 osds: 4 up, 4 in > pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects > 141 GB used, 2323 GB / 2464 GB avail > 256 active+clean > --- > replication size is 2, in can do about 60MB/s writes with rados bench from > a client. > > Setting one OSD out (the data distribution is nearly uniform) it took 12 > minutes to recover on a completely idle (no clients connected) cluster. > The disk utilization was 70-90%, the cluster network hovered around 20%, > never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue. > Given the ceph log numbers and the data size, I make this a recovery speed > of about 40MB/s or 13MB/s per OSD. > Better than I expected, but a far cry from what the OSDs could do > individually if they were not flooded with concurrent read and write > requests by the backfilling operation. Hi Christian, My apologies for not noticing you were running the test cluster with a journal collocated with the data on a spinner. In this case I would indeed expect that I/O is the blocking factor because randomized operations can reduce the disk throughput by an order of magnitude. If you have the journal on a SSD, which is what is generally recommended, you should be able to observe a significant improvement. Such a setup also better reflect the architecture of a large cluster and extrapolations will be more accurate. Cheers > Now, more disks will help, but I very much doubt that this will scale > linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please). > > And this was an IDLE cluster. > > Doing this on a cluster with just about 10 client IOPS per OSD would be > far worse. Never mind that people don't like their client IO to stall for > more than a few seconds. > > Something that might improve this booth in terms of speed and impact to > the clients would be something akin to the MD (linux software raid) > recovery logic. > As in, only one backfill operation per OSD (read or write, not both!) at > the same time. > > Regards, > > Christian >> Cheers >> >>> More in another reply. >>> >>>> Cheers >>>> >>>> On 26/08/2014 19:37, Craig Lewis wrote: >>>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd >>>>> max backfills = 1). I believe that increases my risk of failure by >>>>> 48^2 . Since your numbers are failure rate per hour per disk, I need >>>>> to consider the risk for the whole time for each disk. So more >>>>> formally, rebuild time to the power of (replicas -1). >>>>> >>>>> So I'm at 2304/100,000,000, or approximately 1/43,000. That's a >>>>> much higher risk than 1 / 10^8. >>>>> >>>>> >>>>> A risk of 1/43,000 means that I'm more likely to lose data due to >>>>> human error than disk failure. Still, I can put a small bit of >>>>> effort in to optimize recovery speed, and lower this number. >>>>> Managing human error is much harder. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org >>>>> <mailto:loic at dachary.org>> wrote: >>>>> >>>>> Using percentages instead of numbers lead me to calculations >>>>> errors. Here it is again using 1/100 instead of % for clarity ;-) >>>>> >>>>> Assuming that: >>>>> >>>>> * The pool is configured for three replicas (size = 3 which is >>>>> the default) >>>>> * It takes one hour for Ceph to recover from the loss of a single >>>>> OSD >>>>> * Any other disk has a 1/100,000 chance to fail within the hour >>>>> following the failure of the first disk (assuming AFR >>>>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk >>>>> is 8%, divided by the number of hours during a year == (0.08 / 8760) >>>>> ~= 1/100,000 >>>>> * A given disk does not participate in more than 100 PG >>>>> >>>> >>> >>> >> > > -- Lo?c Dachary, Artisan Logiciel Libre -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 263 bytes Desc: OpenPGP digital signature URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/5a7f3b0d/attachment.pgp>