Upgrade Ceph 13.2.0 -> 13.2.1 and Windows iSCSI clients breakup

Wladimir Mutel <mwg@xxxxxxxxx> · Sat, 28 Jul 2018 23:59:16 +0300

	Dear all,

	I want to share some experience of upgrading my experimental 1-host 
Ceph cluster from v13.2.0 to v13.2.1.
 First, I fetched new packages and installed them using 'apt 
dist-upgrade', which went smooth as usual.
 Then I noticed from 'lsof', that Ceph daemons were not restarted after 
the upgrade ('ceph osd versions' still showed 13.2.0).
 Using instructions on Luminous->Mimic upgrade, I decided to restart 
ceph-{mon,mgr,osd}.targets.
 And surely, on restarting ceph-osd.target, iSCSI sessions had been 
broken on tcmu-runner side ('Timing out cmd', 'Handler connection 
lost'), and Windows (2008 R2) clients lost their iSCSI devices.
 But that was only a beginning of surprises that followed.
Looking into Windows Disk Management, I noticed that iSCSI disks were 
re-detected with size about 0.12 Gb larger, i.e. 2794.52 GB instead of 
2794.40 GB, and of course the system lost their GPT labels from its 
sight. I quickly checked 'rbd info' on Ceph side and did not notice any 
increase in RBD images. They were still exactly 715398 4MB-objects as I 
intended initially.
 Restarting iSCSI initiator service on Windows did not help. Restarting 
the whole Windows did not help. Restarting tcmu-runner on Ceph side did 
not help. What resolved the problem, to my great surprise, was 
_removing/re-adding MPIO feature and re-adding iSCSI multipath support_.
After that, Windows detected iSCSI disks with proper size again, and 
restored visibility of GPT partitions, dynamic disk metadata and all the 
rest.

	Ok, I avoided data loss at this time, but I have some remaining questions :

1. Can Ceph minor version upgrades be made less disruptive and 
traumatic? Like, some king of staged/rolling OSD daemons restart within 
single upgraded host, without losing librbd sessions ?

2. Is Windows (2008 R2) MPIO support really that screwed & crippled ? 
Were there any improvements in Win2012/2016 ? I have physical servers 
with Windows 2008 R2, and I would like to mirror their volumes to Ceph 
iSCSI targets, then convert them into QEMU/KVM virtual machines where 
the same data will be accessed with librbd. During my initial 
experiments, I found that reinstalling MPIO & re-enabling iSCSI 
multipath would fix most problems in Windows iSCSI access, but I would 
like to have a faster way of resetting iSCSI+MPIO state when something 
is going wrong on Windows side like in my case.

3. Anybody has an idea of where these 0.12 GB (probably 120 or 128 MB) 
were taken from ?

	Thank you in advance for your responses.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com