85% of the cluster won't start, or how I learned why to use disk UUIDs

Steve Anthony <sma310@xxxxxxxxxx> · Tue, 27 Jan 2015 11:43:01 -0500

Story time. Over the past year or so, our datacenter had been undergoing
the first of a series or renovations designed to add more power and
cooling capacity. As part of these renovations, changes to the emergency
power off system (EPO) necessitated that this system be tested. If
you're unfamiliar, the EPO system is tied into the fire system, and
presented as a angry caged red button next to each exit design to
*immediately* cut all power and backup power to the datacenter. The idea
being that if there's a fire, or someone's being electrocuted, or some
other life-threatening electrical shenanigans occur, power can be
completely cut in one swift action. As this system hadn't been tested in
about 10 years and a whole bunch of changes had been made due to the
renovations, the powers that be scheduled downtime for all services one
Saturday at which time we would test the EPO and cut all power to the room.

On the appointed day, I shut down each of the 21 nodes and the 3
monitors in our cluster. A couple hours later, after testing and some
associated work had been completed, I powered the monitors back up and
began turning on the nodes holding the spinning OSDs and associated SSD
journals. After pushing the power buttons, I sat down at the console and
noticed something odd; only about 15% of the OSDs in the cluster had
come back online. Checking the logs, I noticed that the OSDs which had
failed to start were complaining about not being able to find their
associated journal partitions.

Fortunately, two things were true at this point. First and most
importantly, I had split off 8 nodes which had not been added to the
cluster yet and set up a second separate cluster in another site to
which I had exported/imported the critical images (and diffs) from the
primary cluster over the past few weeks. Second, I had happened to
restart a node a month or so prior which had presented the same
symptoms, so I knew why this had happened. When I first provisioned the
cluster I added the journals using the /dev/sd[a-z]+ identifier. On the
first four nodes, which I had provisioned manually, this was fine. On
subsequent nodes, I had used FAI Linux, Saltstack, and a Python script I
wrote to automatically provision the OS, configuration, and add the OSDs
and journals as they were inserted into the nodes. After a reboot on
these nodes, the devices were reordered, and the OSDs subsequently
couldn't find the journals. I had written a script to trickle
remove/re-insert OSDs one by one with journals using /dev/disk/by-id
(which is a persistent identifier), but hadn't yet run it on the
production cluster.

After some thought, I came up with a potential (if somewhat unpleasant)
solution which would let me get the production cluster back into a
running state quickly, without having to blow away the whole thing,
re-provision, and restore the backups. I theorized that if I shutdown a
node, removed all the hot-swap disks (the OSDs and journals), booted the
node, and then added the journals in the same order as I had when the
node was first provisioned, the OS should give them the same
/dev/sd[a-z}+ identifiers they had had pre-EPO. A quick test determined
I was correct, and could restore the cluster to working order by
applying the same operation to each node. Luckily, I had (mostly) added
drives to each node in the same order and where I hadn't at least one
journal was placed in the correct order and that allowed me to determine
the correct order for the other two, ie. if journal 2 as ok, but 1 and 3
weren't when I had added them in order 1,2,3, I knew the correct order
was 3,2,1. After pulling and re-inserting 336 disks, I had a working
cluster once again, except for one node where one journal had originally
been /dev/sda, which was now half of the OS software RAID mirror.
Breaking that, toggling the /sys/block/sdX/device/delete flag on that
disk, rescanning the bus, re-adding it to the RAID set when it came back
as /dev/sds, and symlinking /dev/sda to the appropriate SSD fixed that
last node.

Needless to say, I started pulling that node, and subsequently the other
nodes out of the cluster and readding them with /dev/disk/by-id journal
to prevent this from happening again. So a couple lessons here. First,
remember when adding OSDs with SSD journals to use a device UUID, not
/dev/sd[a-z]+, so you don't end up needing to spend three hours manually
touching each disk in your cluster and even longer slowly shifting a
couple hundred terabytes around while you fix the root cause. Second,
establish standards early and stick to them. As much of a headache as
pulling all the disks and re-inserting them was, it would have been much
worse if they weren't originally inserted in the same order on (almost)
all the nodes. Finally, backups are important. Having that safety net
helped me focus on the solution, rather than the problem since I knew
that if none of my ideas worked, I'd be able to get the most critical
data back.

Hopefully this saves someone from making the same mistakes!

-Steve

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma310@xxxxxxxxxx

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com