On Wed, Nov 6, 2013 at 4:15 PM, Darren Birkett <darren.birkett@xxxxxxxxx> wrote: > Hi, > > I understand from various reading and research that there are a number of > things to consider when deciding how many disks one wants to put into a > single chassis: > > 1. Higher density means higher failure domain (more data to re-replicate if > you lose a node) > 2. More disks means more CPU/memory horsepower to handle the number of OSDs > 3. Network becomes a bottleneck with too many OSDs per node > 4. ... > > We are looking at building high density nodes for small scale 'starter' > deployments for our customers (maybe 4 or 5 nodes). High density in this > case could mean a 2u chassis with 2x external 45 disk JBOD containers > attached. That's 90 3TB disks/OSDs to be managed by a single node. That's > about 243TB of potential usable space, and so (assuming up to 75% fillage) > maybe 182TB of potential data 'loss' in the event of a node failure. On an > uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say > that would take about 45 hours to get the cluster back into an undegraded > state - that is the requisite number of copies of all objects. > For such large number of disks you should consider that the cache amortization will not take any place even if you are using 1GB controller(s) - only tiered cache can be an option. Also recovery will take much more time even if you have a room for client I/O in the calculations because raw disks have very limited IOPS capacity and recovery will either take a much longer than such expectations at a glance or affect regular operations. For S3/Swift it may be acceptable but for VM images it does not. > Assuming that you can shove in a pair of hex core hyperthreaded processors, > you're probably OK with number 2. If you're already considering 10GbE > networking for the storage network, there's probably not much you can do > about 3 unless you want to spend a lot more money (and the reason we're > going so dense is to keep this as a cheap option). So the main thing would > seem to be a real fear of 'losing' so much data in the event of a node > failure. Who wants to wait 45 hours (probably much longer assuming the > cluster remains live and has production traffic traversing that networl) for > the cluster to self-heal? > > But surely this fear is based on an assumption that in that time, you've not > identified and replaced the failed chassis. That you would sit for 2-3 days > and just leave the cluster to catch up, and not actually address the broken > node. Given good data centre processes, a good stock of spare parts, isn't > it more likely that you'd have replaced that node and got things back up and > running in a mater of hours? In all likelyhood, a node crash/failure is not > likely to have taken out all, or maybe any, of the disks, and a new chassis > can just have the JBODs plugged back in and away you go? > > I'm sure I'm missing some other pieces, but if you're comfortable with your > hardware replacement processes, doesn't number 1 become a non-fear really? I > understand that in some ways it goes against the concept of ceph being self > healing, and that in an ideal world you'd have lots of lower density nodes > to limit your failure domain, but when being driven by cost isn't this an OK > way to look at things? What other glaringly obvious considerations am I > missing with this approach? > > Darren > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com