VM Data corruption shortly after Luminous Upgrade

James Forde <jimf@xxxxxxxxx> · Mon, 6 Nov 2017 16:23:56 +0000

Weird but Very bad problem with my test cluster 2-3 weeks after upgrading to Luminous.

All 7 running VM’s are corrupted and unbootable. 6 Windows and 1 CentOS7. Windows error is “unmountable boot volume”. CentOS7 will only boot to emergency mode.
3 VM’s that were off during event work as expected. 2 Windows and 1 Ubuntu.

History: 
7 node cluster: 5 OSD, 3 MON, (1 is MON-OSD). Plus 2 KVM nodes.

System originally running Jewel on old Tower servers. Migrated to all rackmount servers. Then upgraded to Kraken. Kraken added the MGR servers.

On the 13^th or 14^th of October Upgraded to Luminous. Upgrade went smoothly. Ceph versions showed all nodes running 12.2.1, Health_OK. Even checked out the Ceph Dashboard.

Then around the 20^th I created a master for cloning, spun off a clone, mucked around with it, flattened it so it was stand alone, and shut it and the master off.

Problem:
On November 1^st I started the clone and got the following error.
“failed to start domain internal error: qemu unexpectedly closed the monitor vice virtio-balloon”

To resolve: (restart MON’s one at a time)
I restarted 1 MON. tried to restart clone. Same error.
Restarted 2^nd MON. All 7 running VMs shut off!
Restarted 3^rd MON. Clone now runs. Try to start any of the 7 VM’s that were running. “Unmountable Boot Volume”

Pulled the logs on all nodes and am going through them.
So far have found this.

“terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'

  what():  buffer::end_of_buffer

terminate called recursively

2017-11-01 19:41:48.814+0000: shutting down, reason=crashed”

Possible monmap corruption?
Any insight would be greatly appreciated.

Hints?
After the Luminous upgrade, ceph osd tree had nothing in the class column. After restarting the MON’s, the MON-OSD node had “hdd” on each osd.
After restarting the entire cluster all OSD servers had “hdd” in the class column. Not sure why this would not have happened right after upgrade.

Also after restart the mgr servers failed to start. “ key for mgr.HOST exists
 but cap mds does not match” 
Solved per https://www.seekhole.io/?p=12
$ ceph auth caps mgr.HOST mon 'allow profile mgr' mds 'allow *' osd 'allow *'
Again, not sure why this would not have manifested itself at the upgrade when all servers were restarted.

-Jim

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com