Story time. Over the past year or so, our datacenter had been undergoing the first of a series or renovations designed to add more power and cooling capacity. As part of these renovations, changes to the emergency power off system (EPO) necessitated that this system be tested. If you're unfamiliar, the EPO system is tied into the fire system, and presented as a angry caged red button next to each exit design to *immediately* cut all power and backup power to the datacenter. The idea being that if there's a fire, or someone's being electrocuted, or some other life-threatening electrical shenanigans occur, power can be completely cut in one swift action. As this system hadn't been tested in about 10 years and a whole bunch of changes had been made due to the renovations, the powers that be scheduled downtime for all services one Saturday at which time we would test the EPO and cut all power to the room. On the appointed day, I shut down each of the 21 nodes and the 3 monitors in our cluster. A couple hours later, after testing and some associated work had been completed, I powered the monitors back up and began turning on the nodes holding the spinning OSDs and associated SSD journals. After pushing the power buttons, I sat down at the console and noticed something odd; only about 15% of the OSDs in the cluster had come back online. Checking the logs, I noticed that the OSDs which had failed to start were complaining about not being able to find their associated journal partitions. Fortunately, two things were true at this point. First and most importantly, I had split off 8 nodes which had not been added to the cluster yet and set up a second separate cluster in another site to which I had exported/imported the critical images (and diffs) from the primary cluster over the past few weeks. Second, I had happened to restart a node a month or so prior which had presented the same symptoms, so I knew why this had happened. When I first provisioned the cluster I added the journals using the /dev/sd[a-z]+ identifier. On the first four nodes, which I had provisioned manually, this was fine. On subsequent nodes, I had used FAI Linux, Saltstack, and a Python script I wrote to automatically provision the OS, configuration, and add the OSDs and journals as they were inserted into the nodes. After a reboot on these nodes, the devices were reordered, and the OSDs subsequently couldn't find the journals. I had written a script to trickle remove/re-insert OSDs one by one with journals using /dev/disk/by-id (which is a persistent identifier), but hadn't yet run it on the production cluster. After some thought, I came up with a potential (if somewhat unpleasant) solution which would let me get the production cluster back into a running state quickly, without having to blow away the whole thing, re-provision, and restore the backups. I theorized that if I shutdown a node, removed all the hot-swap disks (the OSDs and journals), booted the node, and then added the journals in the same order as I had when the node was first provisioned, the OS should give them the same /dev/sd[a-z}+ identifiers they had had pre-EPO. A quick test determined I was correct, and could restore the cluster to working order by applying the same operation to each node. Luckily, I had (mostly) added drives to each node in the same order and where I hadn't at least one journal was placed in the correct order and that allowed me to determine the correct order for the other two, ie. if journal 2 as ok, but 1 and 3 weren't when I had added them in order 1,2,3, I knew the correct order was 3,2,1. After pulling and re-inserting 336 disks, I had a working cluster once again, except for one node where one journal had originally been /dev/sda, which was now half of the OS software RAID mirror. Breaking that, toggling the /sys/block/sdX/device/delete flag on that disk, rescanning the bus, re-adding it to the RAID set when it came back as /dev/sds, and symlinking /dev/sda to the appropriate SSD fixed that last node. Needless to say, I started pulling that node, and subsequently the other nodes out of the cluster and readding them with /dev/disk/by-id journal to prevent this from happening again. So a couple lessons here. First, remember when adding OSDs with SSD journals to use a device UUID, not /dev/sd[a-z]+, so you don't end up needing to spend three hours manually touching each disk in your cluster and even longer slowly shifting a couple hundred terabytes around while you fix the root cause. Second, establish standards early and stick to them. As much of a headache as pulling all the disks and re-inserting them was, it would have been much worse if they weren't originally inserted in the same order on (almost) all the nodes. Finally, backups are important. Having that safety net helped me focus on the solution, rather than the problem since I knew that if none of my ideas worked, I'd be able to get the most critical data back. Hopefully this saves someone from making the same mistakes! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma310@xxxxxxxxxx
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com