Hi,
i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the client side. In the tcmu-runner logs i the the following happening every few seconds:
### 2019-10-22 10:11:11.231 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired exclusive lock. 2019-10-22 10:11:11.395 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:12.346 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: Async lock drop. Old state 1 2019-10-22 10:11:12.353 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:13.325 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:13.852 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:13.854 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:13.863 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:14.202 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:14.285 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:15.217 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired exclusive lock. 2019-10-22 10:11:15.873 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. 2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: Async lock drop. Old state 1 2019-10-22 10:11:16.696 1710 [INFO] alua_implicit_transition:566 rbd/image.lun0: Starting lock acquisition operation. 2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: Async lock drop. Old state 2 2019-10-22 10:11:16.992 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. ###
This happens on all of my four iscsi exporter nodes. Blacklist gives me the following (number of blacklisted objects does not really shrink):
### ceph osd blacklist ls
listed 10579 entries ###
On the client site i configured the multipath config like this:
### device { vendor "LIO-ORG" hardware_handler "1 alua" path_grouping_policy "failover" path_selector "queue-length 0" failback 60 path_checker tur prio alua prio_args exclusive_pref_bit fast_io_fail_tmo 25 no_path_retry queue } ###
And multipath -ll shows me all four path as "active ready" without errors.
For me this looks like the tcmu-runner cannot aquire exclusive lock and it is flapping between nodes. In addition, in the ceph gui / dashboard i can see the LUNs in the "active / optimized" state are flapping between nodes ...
I'm have installed the following versions (CentOS 7.7, Ceph 13.2.6):
### rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"
python-cephfs-13.2.6-0.el7.x86_64 ceph-selinux-13.2.6-0.el7.x86_64 kernel-3.10.0-957.5.1.el7.x86_64 kernel-3.10.0-957.1.3.el7.x86_64 kernel-tools-libs-3.10.0-1062.1.2.el7.x86_64 libcephfs2-13.2.6-0.el7.x86_64 libtcmu-1.4.0-106.gd17d24e.el7.x86_64 ceph-common-13.2.6-0.el7.x86_64 ceph-osd-13.2.6-0.el7.x86_64 tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 kernel-3.10.0-1062.1.2.el7.x86_64 ceph-iscsi-3.3-1.el7.noarch kernel-headers-3.10.0-1062.1.2.el7.x86_64 kernel-3.10.0-862.14.4.el7.x86_64 ceph-base-13.2.6-0.el7.x86_64 kernel-tools-3.10.0-1062.1.2.el7.x86_64 ###
Greets, Kilian |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com