Re: Boot volume on OSD device

Brian Topping <brian.topping@xxxxxxxxx> · Sat, 19 Jan 2019 13:50:45 -0700

On Jan 18, 2019, at 10:58 AM, Hector Martin <hector@xxxxxxxxxxxxxx> wrote:

Just to add a related experience: you still need 1.0 metadata (that's
the 1.x variant at the end of the partition, like 0.9.0) for an
mdadm-backed EFI system partition if you boot using UEFI. This generally
works well, except on some Dell servers where the firmware inexplicably
*writes* to the ESP, messing up the RAID mirroring. 

I love this list. You guys are great. I have to admit I was kind of intimidated at first, I felt a little unworthy in the face of such cutting-edge tech. Thanks to everyone that’s helped with my posts.

Hector, one of the things I was thinking through last night and finally pulled the trigger on today was the overhead of various subsystems. LVM does not create much overhead, but tiny initial mistakes explode into a lot of wasted CPU over the course of a deployment lifetime. So I wanted to review everything and thought I would share my notes here.

My main constraint is I had four disks on a single machine to start with and any one of the disks should be able to fail without affecting the ability for the machine to boot, the bad disk replaced without requiring obscure admin skills, and the final recovery to the promised land of “HEALTH_OK”. A single machine Ceph deployment is not much better than just using local storage, except the ability to later scale out. That’s the use case I’m addressing here.

The first exploration I had was how to optimize for a good balance between safety for mon logs, disk usage and performance for the boot partitions. As I learned, an OSD can fit in a single partition with no spillover, so I had three partitions to work with. `inotifywait -mr /var/lib/ceph/` provided a good handle on what was being written to the log and with what frequency and I could see that the log was mostly writes.

https://theithollow.com/2012/03/21/understanding-raid-penalty/ provided a good background that I did not previously have on the RAID write penalty. I combined this with what I learned in https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328. By the end of these two articles, I felt like I knew all the tradeoffs, but the final decision really came down to the penalty table in the first article and a “RAID penalty” of 2 for RAID 10, which was the same as the penalty for RAID 1, but with 50% better storage efficiency.

For the boot partition, there are fewer choices. Specifying anything other than RAID 1 will not keep all the copies of /boot both up-to-date and ready to seamlessly restart the machine in case of a disk failure. Combined with a choice of RAID 10 for the root partition, we are left with a configuration that can reliably boot from any single drive failure (maybe two, I don’t know what mdadm would do if a “less than perfect storm” happened with one mirror from each stripe were to be lost instead of two mirrors from one stripe…)

With this setup, each disk used exactly two partitions and mdadm is using the latest MD metadata because Grub2 knows how to deal with everything. As well, `sfdisk /dev/sd[abcd]` shows all disks marked with the first partition as bootable. Milestone 1 success!

The next piece I was unsure of but didn’t want to spam the list with stuff I could just try was how many partitions an OSD would use. Hector mentioned that he was using LVM for Bluestore volumes. I privately wondered the value in creating LVM VGs when groups did not span disks. But this is exactly what the `ceph-deploy osd create` command as documented does in creating Bluestore OSDs. Knowing how to wire LVM is not rocket science, but if possible, I wanted to avoid as many manual steps as possible. This was a biggie.

And after adding the OSD partitions one after the other, “HEALTH_OK”. w00t!!! Final Milestone Success!!

I know there’s no perfect starter configuration for every hardware environment, but I thought I would share exactly what I ended up with here for future seekers. This has been a fun adventure. 

Next up: Convert my existing two pre-production nodes that need to use this layout. Fortunately there’s nothing on the second node except Ceph and I can take that one down pretty easily. It will be good practice to gracefully shut down the four OSDs on that node without losing any data, reformat the node with this pattern, bring it the cluster back to health, then migrate the mon (and the workloads) to it while I do the same for the first node. With that, I’ll be able to remove these satanic SATADOMs and get back to some real work!! 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com