On 20/01/2019 05.50, Brian Topping wrote: > My main constraint is I had four disks on a single machine to start with > and any one of the disks should be able to fail without affecting the > ability for the machine to boot, the bad disk replaced without requiring > obscure admin skills, and the final recovery to the promised land of > “HEALTH_OK”. A single machine Ceph deployment is not much better than > just using local storage, except the ability to later scale out. That’s > the use case I’m addressing here. I assume parititioning the drive and using mdadm to add it to one or more RAID arrays and then dealing with the Ceph side doesn't qualify as "obscure admin skills", right? :-) (I also use single-host Ceph deployments; I like its properties over traditional RAID or things like ZFS). > https://theithollow.com/2012/03/21/understanding-raid-penalty/ provided > a good background that I did not previously have on the RAID write > penalty. I combined this with what I learned > in https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328. > By the end of these two articles, I felt like I knew all the tradeoffs, > but the final decision really came down to the penalty table in the > first article and a “RAID penalty” of 2 for RAID 10, which was the same > as the penalty for RAID 1, but with 50% better storage efficiency. FWIW, I disagree with that article on RAID write penalty. It's an oversimplification and the math doesn't really add up. I don't like the way they define the concept of "write penalty" relative to the sum of disk performance. It should be relative to a single disk. Here's my take on it. First of all, you need to consider three different performance metrics for writes: - Sequential writes (seq) - Random writes < stripe size (small) - Random writes >> stripe size or aligned (large) * stripe size is the size across all disks for RAID5/6, but a single disk for RAID0 And here is the performance, where n is the number of disks, relative to a single disk of the same type: seq small large RAID 0 n n 1 RAID 1 1 1 1 RAID 5 n-1 0.5 1 RAID 6 n-2 0.5 1 RAID 10 n/2 n/2 1 RAID0 gives a throughput improvement proportional to the number of disks, and the same small IOPS improvement *on average* (assuming your I/Os hit all the disks equally, not like repeatedly hammering one stripe chunk). There is also some loss of performance because whenever I/O hits multiple disks the *slowest* disk becomes the bottleneck, so if the worst case latency is 10ms for a single disk, your average latency is 5ms, but the average latency for the slowest of two disks is 6.6ms, for three disks 7.5ms, etc. approaching 10ms as you add disks. RAID1 is just like using a single disk, really. All the disks do the same thing in parallel. That's it. RAID5 has the same sequential improvement as RAID0, except with one fewer disk, because parity takes one disk. However, small writes become read-modify-write operations (it has to read the old data and parity to update the parity), so you get half the IOPS. If your write is stripe-aligned this penalty goes away, and misaligned writes larger than several stripes amortize the penalty (it only hits the beginning and end), so the performance approaches 1 as your write size increases, and exceeds it as the sequential effect starts to dominate. RAID6 is like RAID5 but with two parity disks. You still need a (parallel) read and a (parallel) write for every small write. RAID10 is just a RAID0 of RAID1s, so you ignore half the disks (the mirrors) and the rest behave like RAID0. The large/aligned I/O performance is identical to a single disk across all RAID levels, because when your I/Os are larger than one stripe, then *all* disks across the RAID have to handle the I/O (in parallel). This is all assuming no controller or CPU bottlenecking. Realistically, with software RAID and a non-terrible HBA, this is a pretty reasonable assumption. There will be some extra overhead, but not much. Also, some of the impact of RAID5/6 will be reduced by caching (hardware cache with hardware RAID, or software stripe cache with md-raid). This is all still a simplification in some ways, but I think it's closer to reality than that article. (somewhat offtopic for this list, but after seeing that article I felt I had to try my own attempt at doing the math here). Personally, I've had two set-ups like yours and this is what I did: - On a production cluster with several OSDs with 4 disks (and no boot drive), I used a 4-disk RAID1 for /boot and a 2-disk RAID1 with 2 spares for /. This provides possibly a bit more fail-safe reliability in that the RAID will auto-recover to the spares when something goes wrong (instead of having to wait for a human to fix things). You could have a 4-disk RAID1, but there is some minor penalty (not detailed in my explanation above) for replicating all writes across 4 disks and I felt it wasn't necessary. The root partition is small, so recovery is fast. - On a personal single-host cluster with 15 drives on an external disk enclosure, which was *supposed* to have internal boot drives but a SNAFU meant it didn't initially, I booted off of two USB drives (RAID1 /boot partition) and then carved out a bit of osd.0 and osd.1 (both full-LVM disks using ceph-volume and no partitions) into a separate LV and made and md-RAID1 out of that. This was a temporary setup, it's fixed now. If it were permanent I probably would've RAIDded across more of the OSDs. - On another personal single-host cluster (that machine also does a bunch of other stuff), root/boot is on a pair of SSDs in md-raid1, but I have each of the 8 OSD disks as an LVM VG with two LVs: OSD proper and a component of an md-raid6 array. The reason is to keep a small portion of the disks as a traditional RAID in case I end up with use cases where I discover Ceph doesn't meet my performance needs (I am migrating this host from a setup using a big RAID6 to Ceph). If you're running a single-host cluster, then having the mon survive is *very important*. Disk redundancy in Ceph is often 3x or erasure coding with m=2, therefore it's somewhat silly to have the mon on a 2x RAID1 (or RAID10) if you have more than 2 disks for that. If you're RAIDing your root filesystem across the OSD disks, I think it makes more sense to set up e.g. a three-disk RAID1 with a spare. That way you can lose any two disks and not lose your entire cluster. I think there isn't much merit to RAID10 for a Ceph host's rootfs, unless you're doing other things on the machine and want the extra storage efficiency or performance for other reasons. Might as well keep things simpler (and more flexible) with some variant of RAID1 on some of the disks and the rest as hot spares. Keep in mind that RAID10 has lower reliability than even a 2-disk RAID1. With 2-disk RAID1, you can lose half of your disks (one). With 4-disk RAID10, you can lose any one disk and after that you can only lose *two of three specific* disks, so losing an extra disk has a 1/3 chance of losing all your data. I.e. the more disks you have, the more disks may fail. > The next piece I was unsure of but didn’t want to spam the list with > stuff I could just try was how many partitions an OSD would use. Hector > mentioned that he was using LVM for Bluestore volumes. I privately > wondered the value in creating LVM VGs when groups did not span disks. > But this is exactly what the `ceph-deploy osd create` command as > documented does in creating Bluestore OSDs. Knowing how to wire LVM is > not rocket science, but if possible, I wanted to avoid as many manual > steps as possible. This was a biggie. Ah, I guess ceph-deploy just makes an LVM VG for you with ceph-volume. So you ended up with two partitions for the /boot and / RAIDs plus a partition for the OSD as an LVM PV, per disk? Yeah, that works; it's less flexible than using LVM for your non-Ceph uses too (except the ESP if you need one), but probably easier to set up than md-raid-on-lvm which is what I use (actually what I have on one machine is lvm-on-dmcrypt-on-md-raid-on-lvm, which as you can imagine required some tweaking of startup scripts to make work with LVM on both ends!) Ultimately a lot of this is dictated by whatever tools you feel comfortable using :-) -- Hector Martin (hector@xxxxxxxxxxxxxx) Public Key: https://mrcn.st/pub _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com