On 27/08/2014 04:34, Christian Balzer wrote: > > Hello, > > On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: > >> Hi Craig, >> >> I assume the reason for the 48 hours recovery time is to keep the cost >> of the cluster low ? I wrote "1h recovery time" because it is roughly >> the time it would take to move 4TB over a 10Gb/s link. Could you upgrade >> your hardware to reduce the recovery time to less than two hours ? Or >> are there factors other than cost that prevent this ? >> > > I doubt Craig is operating on a shoestring budget. > And even if his network were to be just GbE, that would still make it only > 10 hours according to your wishful thinking formula. > > He probably has set the max_backfills to 1 because that is the level of > I/O his OSDs can handle w/o degrading cluster performance too much. > The network is unlikely to be the limiting factor. > > The way I see it most Ceph clusters are in sort of steady state when > operating normally, i.e. a few hundred VM RBD images ticking over, most > actual OSD disk ops are writes, as nearly all hot objects that are being > read are in the page cache of the storage nodes. > Easy peasy. > > Until something happens that breaks this routine, like a deep scrub, all > those VMs rebooting at the same time or a backfill caused by a failed OSD. > Now all of a sudden client ops compete with the backfill ops, page caches > are no longer hot, the spinners are seeking left and right. > Pandemonium. > > I doubt very much that even with a SSD backed cluster you would get away > with less than 2 hours for 4TB. > > To give you some real life numbers, I currently am building a new cluster > but for the time being have only one storage node to play with. > It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual > OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. > > So I took out one OSD (reweight 0 first, then the usual removal steps) > because the actual disk was wonky. Replaced the disk and re-added the OSD. > Both operations took about the same time, 4 minutes for evacuating the OSD > (having 7 write targets clearly helped) for measly 12GB or about 50MB/s > and 5 minutes or about 35MB/ for refilling the OSD. > And that is on one node (thus no network latency) that has the default > parameters (so a max_backfill of 10) which was otherwise totally idle. > > In other words, in this pretty ideal case it would have taken 22 hours > to re-distribute 4TB. That makes sense to me :-) When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica which is what I had in mind. If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle. Am I being too optimistic ? Do you see another blocking factor that would significantly slow down recovery ? Cheers > More in another reply. > >> Cheers >> >> On 26/08/2014 19:37, Craig Lewis wrote: >>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd >>> max backfills = 1). I believe that increases my risk of failure by >>> 48^2 . Since your numbers are failure rate per hour per disk, I need >>> to consider the risk for the whole time for each disk. So more >>> formally, rebuild time to the power of (replicas -1). >>> >>> So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much >>> higher risk than 1 / 10^8. >>> >>> >>> A risk of 1/43,000 means that I'm more likely to lose data due to >>> human error than disk failure. Still, I can put a small bit of effort >>> in to optimize recovery speed, and lower this number. Managing human >>> error is much harder. >>> >>> >>> >>> >>> >>> >>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org >>> <mailto:loic at dachary.org>> wrote: >>> >>> Using percentages instead of numbers lead me to calculations >>> errors. Here it is again using 1/100 instead of % for clarity ;-) >>> >>> Assuming that: >>> >>> * The pool is configured for three replicas (size = 3 which is the >>> default) >>> * It takes one hour for Ceph to recover from the loss of a single >>> OSD >>> * Any other disk has a 1/100,000 chance to fail within the hour >>> following the failure of the first disk (assuming AFR >>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is >>> 8%, divided by the number of hours during a year == (0.08 / 8760) ~= >>> 1/100,000 >>> * A given disk does not participate in more than 100 PG >>> >> > > -- Lo?c Dachary, Artisan Logiciel Libre -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 263 bytes Desc: OpenPGP digital signature URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140827/9f35796c/attachment.pgp>