Hi Marcus- The two basic configurations: - OSD per disk (or smaller set of disks) - pro: effeciently utilize disks - pro: isolate software faults - con: recovery copies objects over network from other nodes - OSD per node + raid - pro: most disk failure do not translate to osd failure - pro: recovery is intra-node; does not incur any network traffic - con: you pay lose a few disks to parity - con: some fault isolation is lost (raid controller or daemon can fail) - con: some performance isolation is lost The recovery tradeoff is a bit of a simplification, actually. RAID recovery walks the entire disk platter. Ceph recovery copies just objects, and it's N:N instead of N:1. If a raid disk fails, the controller will walk the entire platter on every disk in the set to rebuild. OTOH, if a Ceph osd fails, maybe ~100 nodes will copy 1/100th of the lost data to ~100 other nodes (fail-in-place). Or if/when the disk is replaced, those ~100 nodes will copy back to the original location. The other thing to keep in mind is the math. When it comes down to it, for an installation of any size, 2x replication is pretty dangerous because any double fault may result in some data loss. (We should probably just make 3x the default.) But RAID can mitigate that somewhat. Instead of replicating across individual disks (with are not terribly reliable), you can replicate across RAID5/6 pools of disks, which are very reliable, which means you can probably get away with 2x at scale _if_ you have some confidence in your ability to nurse an ailing raid set to health for the purposes of recovery. On Wed, 20 Jul 2011, Marcus Sorensen wrote: > I don't really see us pooling individual drives via btrfs into raid1 > or raid10 for redundancy, between that and the hit on object > replication we'd lose too much capacity. I'd wait until the raid5/6 modes in btrfs are merged and stable (probably another year or two). > Also, I'd assume losing the OSD journal is a recoverable event? Just > thinking about whether the SSD should be RAID-1, or going the other > direction if ramdisk would be acceptable. Losing a journal is fully recoverable if you're using btrfs. The OSD's data just warps a bit further back in time to the last consistency point. (There is a small risk currently that the pg logs will be too aggressively trimmed to efficiently rejoin the cluster at that point, but it's an easy fix.) > Have I gone on long enough yet? :-) In short, the usage scenario > we're toying with might be summed up as follows: We'd like to optimize > for cluster stability and capacity, data loss is undesirable of > course, but acceptable so long as the cluster itself can continue > functioning and not bring everything to its knees (clients and all) if > objects are lost. At this point we'll just have to try a few things, > not knowing how failure scenarios might be expected to play out, but I > thought I'd send out this request for comments. You probably want 3x replication (or higher reliability) on metadata, and 2x on file data. The system does not like losing random pieces of the namespace metadata; that's hard to cope with. Losing random bits of file data, on the other hand, does not impact the system's internal integrity... just the user. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html