It's just the design of the iSCSI protocol. Sure, you can lower the timeouts (see "fast_io_fail_tmo" [1]) but you will just end up w/ more false-positive failovers. [1] http://docs.ceph.com/docs/master/rbd/iscsi-initiator-linux/ On Thu, Mar 21, 2019 at 10:46 AM li jerry <div8cn@xxxxxxxxxxx> wrote: > > Hi Maged > > thank you for your reply. > > To exclude the osd_heartbeat_interval and osd_heartbeat_grace factors, I cleared the current lio configuration, redeployed two CENTOS7 (not in any ceph role), and deployed rbd-target-api, rbd-target-gw, trum-runner on it. ; > > And do the following test > 1. centos7 client mounts iscsi lun > 2, write data to iscsi lun through dd > 3. Close the target node that is active. (forced power off) > > [18:33:48 ] active target node power off > [18:33:57] centos7 client found iscsi target interrupted > [18:34:23] centos7 client converts to another target node > > > The whole process lasted for 35 seconds, and ceph was always healthy during the test. > > This conversion time is too long to reach the production level. Do I still have a place to optimize? > > > Below is the centos7 client log [messages] > ============================================================ > > Mar 21 18:33:57 CEPH-client01test kernel: connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4409486146, last ping 4409491148, now 4409496160 > Mar 21 18:33:57 CEPH-client01test kernel: connection4:0: detected conn error (1022) > Mar 21 18:33:57 CEPH-client01test iscsid: Kernel reported iSCSI connection 4:0 error (1022 - Invalid or unknown error code) state (3) > Mar 21 18:34:22 CEPH-client01test kernel: session4: session recovery timed out after 25 secs > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 fd 00 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2358528 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] killing request > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: rejecting I/O to offline device > Mar 21 18:34:22 CEPH-client01test kernel: device-mapper: multipath: Failing path 8:0. > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 fe 00 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2358784 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 fe 80 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2358912 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 f3 00 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2355968 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 f7 80 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2357120 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 f2 80 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2355840 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 fd 80 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2358656 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 f5 00 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2356480 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 23 f7 00 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test kernel: blk_update_request: I/O error, dev sda, sector 2356992 > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Mar 21 18:34:22 CEPH-client01test kernel: sd 5:0:0:0: [sda] CDB: Write(10) 2a 00 00 24 03 00 00 00 80 00 > Mar 21 18:34:22 CEPH-client01test multipathd: sda: mark as failed > Mar 21 18:34:22 CEPH-client01test multipathd: mpathb: remaining active paths: 1 > Mar 21 18:34:22 CEPH-client01test kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports ToluSNA > Mar 21 18:34:23 CEPH-client01test kernel: sd 6:0:0:0: Asymmetric access state changed > Mar 21 18:34:23 CEPH-client01test kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports ToluSNA > Mar 21 18:34:23 CEPH-client01test kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports ToluSNA > Mar 21 18:34:27 CEPH-client01test multipathd: mpathb: sdb - tur checker reports path is up > Mar 21 18:34:27 CEPH-client01test multipathd: 8:16: reinstated > Mar 21 18:34:33 CEPH-client01test iscsid: connect to 172.17.1.23:3260 failed (No route to host) > Mar 21 18:34:41 CEPH-client01test iscsid: connect to 172.17.1.23:3260 failed (No route to host) > > -----邮件原件----- > 发件人: Maged Mokhtar <mmokhtar@xxxxxxxxxxx> > 发送时间: 2019年3月20日 15:36 > 收件人: li jerry <div8cn@xxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx > 主题: Re: CEPH ISCSI LIO multipath change delay > > > > On 20/03/2019 07:43, li jerry wrote: > > Hi,ALL > > > > I’ve deployed mimic(13.2.5) cluster on 3 CentOS 7.6 servers, then > > configured iscsi-target and created a LUN, referring to > > http://docs.ceph.com/docs/mimic/rbd/iscsi-target-cli/. > > > > I have another server which is CentOS 7.4, configured and mounted the > > LUN I’ve just created, referring to > > http://docs.ceph.com/docs/mimic/rbd/iscsi-initiator-linux/. > > > > I’m trying to do a HA testing: > > > > 1. Perform a WRITE test with DD command > > > > 2. Stop one ‘Activate’ iscsi-target node(ini 0), DD IO hangs over 25 > > seconds until iscsi-target switch to another node > > > > 3. DD IO goes back normal > > > > My question is, why it takes so long for the iscsi-target switching? Is > > there any settings I’ve misconfigured? > > > > Usually it only take a few seconds to switch on the enterprise storage > > products. > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > If you mean you shutdown the entire host, if so from your description > this is also running osds, so you also took out some osds serving io. > > if a primary osd is not responding, clients io (in this case your iscsi > target) will block until ceph marks the osd down and issue a new epoch > map mapping the pg to another osd. This process is controlled by > osd_heartbeat_interval(5) and osd_heartbeat_grace(20) total 25 sec which > is what you observe. I do not recommend you lower them, else your > cluster will be over sensitive and osds could flap under load. > > Maged > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com