Re: tcmu-runner crashing on 16.2.5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Xiubo,

Thank you for all the help so far. I was finally able to figure out what the trigger for the issue was and how to make sure it doesn’t happen - at least not in a steady state. There is still the possibility of running into the bug in a failover scenario of some kind, but at least for now I think I’m stable.

I now have two iSCSI gateways running now and I’m not seeing the locks flapping back and forth between the two after making a change on the ESXi cluster that I’ll describe below.

I have 50 ESXi hosts communicating with the Ceph cluster. What happened was that for some reason, some of the hosts did not see the full list of paths to all the iSCSI gateways. In my case, each host should have seen a total of 44 paths for all the LUNs but some were only seeing 32 or 37 (or some other number). This meant that if one of the paths it wasn’t seeing happened to be the primary path, it was not using it and using another path instead. This appear to be what was causing the images to flap back and forth between the two gateways. Once I went through each host and manually rescanned the adapter to discover all the available paths after adding the second iSCSI gateway, everything stabilized. If even one host in the environment doesn’t see all the paths, this flapping occurs.

Am I right to assume that the iSCSI gateways automatically determine which LUN they will advertise being primary for? Is there a command that lets me view which gateway is primary for which LUN? I’m guessing when another gateway gets added, the calculation of who is primary for each LUN gets re-calculated and advertised out to the clients?

-Paul




I did a quick test where I re-enabled a second iSCSI gateway to take a closer look at the paths on the ESXi hosts and I definitely see that when the second path becomes available, different hosts are pointing to different gateways for the Active I/O Path.

I was reading on how ALUA works and as far as I can tell, isn’t CEPH supposed to indicate to the ESXi hosts which iSCSI gateway “owns” a given LUN at any point so that the hosts know which path to make active?

Yeah, the ceph-iscsi/tcmu-runner services will do that. It will report this to the clients.


Could there be something wrong where more than one iSCSI gateway is advertising that it owns the LUN to the ESXi hosts?


This has been test and working well in linux in product and the logic never changed for several years.

I am not very sure how the ESXi internal will handle this but it should be in compliance with the iscsi proto, in linux the multipath could successfully detect which path is active and will choose it.

-Paul


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux