Re: Boot volume on OSD device

Hector Martin <hector@xxxxxxxxxxxxxx> · Mon, 21 Jan 2019 01:20:42 +0900

On 20/01/2019 05.50, Brian Topping wrote:
> My main constraint is I had four disks on a single machine to start with
> and any one of the disks should be able to fail without affecting the
> ability for the machine to boot, the bad disk replaced without requiring
> obscure admin skills, and the final recovery to the promised land of
> “HEALTH_OK”. A single machine Ceph deployment is not much better than
> just using local storage, except the ability to later scale out. That’s
> the use case I’m addressing here.

I assume parititioning the drive and using mdadm to add it to one or
more RAID arrays and then dealing with the Ceph side doesn't qualify as
"obscure admin skills", right? :-)

(I also use single-host Ceph deployments; I like its properties over
traditional RAID or things like ZFS).

> https://theithollow.com/2012/03/21/understanding-raid-penalty/ provided
> a good background that I did not previously have on the RAID write
> penalty. I combined this with what I learned
> in https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328.
> By the end of these two articles, I felt like I knew all the tradeoffs,
> but the final decision really came down to the penalty table in the
> first article and a “RAID penalty” of 2 for RAID 10, which was the same
> as the penalty for RAID 1, but with 50% better storage efficiency.

FWIW, I disagree with that article on RAID write penalty. It's an
oversimplification and the math doesn't really add up. I don't like the
way they define the concept of "write penalty" relative to the sum of
disk performance. It should be relative to a single disk.

Here's my take on it. First of all, you need to consider three different
performance metrics for writes:

- Sequential writes (seq)
- Random writes < stripe size (small)
- Random writes >> stripe size or aligned (large)

* stripe size is the size across all disks for RAID5/6, but a single
disk for RAID0

And here is the performance, where n is the number of disks, relative to
a single disk of the same type:

	seq	small	large
RAID 0	n	n	1
RAID 1	1	1	1
RAID 5	n-1	0.5	1
RAID 6	n-2	0.5	1
RAID 10	n/2	n/2	1

RAID0 gives a throughput improvement proportional to the number of
disks, and the same small IOPS improvement *on average* (assuming your
I/Os hit all the disks equally, not like repeatedly hammering one stripe
chunk). There is also some loss of performance because whenever I/O hits
multiple disks the *slowest* disk becomes the bottleneck, so if the
worst case latency is 10ms for a single disk, your average latency is
5ms, but the average latency for the slowest of two disks is 6.6ms, for
three disks 7.5ms, etc. approaching 10ms as you add disks.

RAID1 is just like using a single disk, really. All the disks do the
same thing in parallel. That's it.

RAID5 has the same sequential improvement as RAID0, except with one
fewer disk, because parity takes one disk. However, small writes become
read-modify-write operations (it has to read the old data and parity to
update the parity), so you get half the IOPS. If your write is
stripe-aligned this penalty goes away, and misaligned writes larger than
several stripes amortize the penalty (it only hits the beginning and
end), so the performance approaches 1 as your write size increases, and
exceeds it as the sequential effect starts to dominate.

RAID6 is like RAID5 but with two parity disks. You still need a
(parallel) read and a (parallel) write for every small write.

RAID10 is just a RAID0 of RAID1s, so you ignore half the disks (the
mirrors) and the rest behave like RAID0.

The large/aligned I/O performance is identical to a single disk across
all RAID levels, because when your I/Os are larger than one stripe, then
*all* disks across the RAID have to handle the I/O (in parallel).

This is all assuming no controller or CPU bottlenecking. Realistically,
with software RAID and a non-terrible HBA, this is a pretty reasonable
assumption. There will be some extra overhead, but not much. Also, some
of the impact of RAID5/6 will be reduced by caching (hardware cache with
hardware RAID, or software stripe cache with md-raid).

This is all still a simplification in some ways, but I think it's closer
to reality than that article.

(somewhat offtopic for this list, but after seeing that article I felt I
had to try my own attempt at doing the math here).

Personally, I've had two set-ups like yours and this is what I did:

- On a production cluster with several OSDs with 4 disks (and no boot
drive), I used a 4-disk RAID1 for /boot and a 2-disk RAID1 with 2 spares
for /. This provides possibly a bit more fail-safe reliability in that
the RAID will auto-recover to the spares when something goes wrong
(instead of having to wait for a human to fix things). You could have a
4-disk RAID1, but there is some minor penalty (not detailed in my
explanation above) for replicating all writes across 4 disks and I felt
it wasn't necessary. The root partition is small, so recovery is fast.

- On a personal single-host cluster with 15 drives on an external disk
enclosure, which was *supposed* to have internal boot drives but a SNAFU
meant it didn't initially, I booted off of two USB drives (RAID1 /boot
partition) and then carved out a bit of osd.0 and osd.1 (both full-LVM
disks using ceph-volume and no partitions) into a separate LV and made
and md-RAID1 out of that. This was a temporary setup, it's fixed now. If
it were permanent I probably would've RAIDded across more of the OSDs.

- On another personal single-host cluster (that machine also does a
bunch of other stuff), root/boot is on a pair of SSDs in md-raid1, but I
have each of the 8 OSD disks as an LVM VG with two LVs: OSD proper and a
component of an md-raid6 array. The reason is to keep a small portion of
the disks as a traditional RAID in case I end up with use cases where I
discover Ceph doesn't meet my performance needs (I am migrating this
host from a setup using a big RAID6 to Ceph).

If you're running a single-host cluster, then having the mon survive is
*very important*. Disk redundancy in Ceph is often 3x or erasure coding
with m=2, therefore it's somewhat silly to have the mon on a 2x RAID1
(or RAID10) if you have more than 2 disks for that. If you're RAIDing
your root filesystem across the OSD disks, I think it makes more sense
to set up e.g. a three-disk RAID1 with a spare. That way you can lose
any two disks and not lose your entire cluster.

I think there isn't much merit to RAID10 for a Ceph host's rootfs,
unless you're doing other things on the machine and want the extra
storage efficiency or performance for other reasons. Might as well keep
things simpler (and more flexible) with some variant of RAID1 on some of
the disks and the rest as hot spares. Keep in mind that RAID10 has lower
reliability than even a 2-disk RAID1. With 2-disk RAID1, you can lose
half of your disks (one). With 4-disk RAID10, you can lose any one disk
and after that you can only lose *two of three specific* disks, so
losing an extra disk has a 1/3 chance of losing all your data. I.e. the
more disks you have, the more disks may fail.

> The next piece I was unsure of but didn’t want to spam the list with
> stuff I could just try was how many partitions an OSD would use. Hector
> mentioned that he was using LVM for Bluestore volumes. I privately
> wondered the value in creating LVM VGs when groups did not span disks.
> But this is exactly what the `ceph-deploy osd create` command as
> documented does in creating Bluestore OSDs. Knowing how to wire LVM is
> not rocket science, but if possible, I wanted to avoid as many manual
> steps as possible. This was a biggie.

Ah, I guess ceph-deploy just makes an LVM VG for you with ceph-volume.
So you ended up with two partitions for the /boot and / RAIDs plus a
partition for the OSD as an LVM PV, per disk? Yeah, that works; it's
less flexible than using LVM for your non-Ceph uses too (except the ESP
if you need one), but probably easier to set up than md-raid-on-lvm
which is what I use (actually what I have on one machine is
lvm-on-dmcrypt-on-md-raid-on-lvm, which as you can imagine required some
tweaking of startup scripts to make work with LVM on both ends!)

Ultimately a lot of this is dictated by whatever tools you feel
comfortable using :-)

-- 
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com