On Thu, Mar 8, 2018 at 2:11 PM, Ashish Samant <ashish.samant@xxxxxxxxxx> wrote: > > > On 03/08/2018 10:44 AM, Mike Christie wrote: >> >> On 03/08/2018 10:59 AM, Lazuardi Nasution wrote: >>> >>> Hi Mike, >>> >>> Since I have moved from LIO to TGT, I can do full ALUA (active/active) >>> of multiple gateways. Of course I have to disable any write back cache >>> at any level (RBD cache and TGT cache). It seem to be safe to disable >>> exclusive lock since each RBD image is accessed only by single client >>> and as long as I know mostly ALUA use RR of I/O path. >> >> It might be possible if you have configured your timers correctly but I >> do not think anyone has figured it all out yet. >> >> Here is a simple but long example of the problem. Sorry for the length, >> but I want to make sure people know the risks. >> >> You have 2 iscsi target nodes and 1 iscsi initiator connected to both >> doing active/active over them. >> >> To make it really easy to hit, the iscsi initiator should be connected >> to the target with a different nic port or network than what is being >> used for ceph traffic. >> >> 1. Prep the data. Just clear the first sector of your iscsi disk. On the >> initiator system do: >> >> dd if=/dev/zero of=/dev/sdb count=1 ofile=direct >> >> 2. Kill the network/port for one of the iscsi targets ceph traffic. So >> for example on target node 1 pull its cable for ceph traffic if you set >> it up where iscsi and ceph use different physical ports. iSCSI traffic >> should be unaffected for this test. >> >> 3. Write some new data over the sector we just wrote in #1. This will >> get sent from the initiator to the target ok, but get stuck in the >> rbd/ceph layer since that network is down: >> >> dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct >> >> 4. The initiator's eh timers will fire and that will fail and will the >> command will get failed and retired on the other path. After that dd in >> #3 completes run: >> >> dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct >> >> This should execute quickly since it goes through the good iscsi and >> ceph path right away. >> >> 5. Now plug the cable back in and wait for maybe 30 seconds for the >> network to come back up and the stuck command to run. >> >> 6. Now do >> >> dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct >> >> The data is going to be the data sent in step 3 and not the new data in >> step 4. >> >> To get around this issue you could try to set the krbd >> osd_request_timeout to a value shorter than the initiator side failover >> time out (for multipath-tools/open-iscsi in linux this would be >> fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also >> account for the transport related timers that might short circut/bypass >> the TMF based EH. >> >> One problem with trying to rely on configuring that is handling all the >> corner cases. So you have: >> >> - Transport (nop) timer or SCSI/TMF command timer set so the >> fast_io_fail/replacement timer starts at N seconds and then fires at M. >> - It is a really bad connection so it takes N - 1 seconds to get the >> SCSI command from the initiator to target. >> - At the N second mark the iscsi connection is dropped the >> fast_io_fail/replacement timer is started. >> >> For the easy case, the SCSI command is sent directly to krbd and so if >> osd_request_timeout is less than M seconds then the command will be >> failed in time and we would not hit the problem above. >> >> If something happens in the target stack like the SCSI command gets >> stuck/queued then your osd_request_timeout value might be too short. For >> example, if you were using tgt/lio right now and this was a >> COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1 >> seconds, and then the write part might take osd_request_timeout -1 >> seconds so you need to have your fast_io_fail long enough for that type >> of case. For tgt a WRITE_SAME command might be N WRITEs to krbd, so you >> need to make sure your queue depths are set so you do not end up with >> something similar as the CAW but where M WRITEs get executed and take >> osd_request_timeout -1 seconds then M more, etc and at some point the >> iscsi connection is lost so the failover timer had started. Some ceph >> requests also might be multiple requests. >> >> Maybe an overly paranoid case, but I still worry about because I do not >> want to mess up anyone's data, is that a disk on the iscsi target node >> goes flakey. In the target we do kmalloc(GFP_KERNEL) to execute a SCSI >> command, and that blocks trying to write data to the flakey disk. If the >> disk recovers and we can eventually recover, did you account for the >> recovery timers in that code path when configuring the failover and krbd >> timers. >> >> One other case we have been debating about is if krbd/librbd is able to >> put the ceph request on the wire but then the iscsi connection goes >> down, will the ceph request always get sent to the OSD before the >> initiator side failover timeouts have fired and it starts using a >> different target node. > > If krbd/librbd is able to put the ceph request on the wire, then that could > cause data corruption in the > active/passive case too, right? In general, yes. However, that's why the LIO/librbd approach uses the RBD exclusive-lock feature in combination w/ Ceph client blacklisting to ensure that cannot occur. Upon path failover, the old RBD client is blacklisted from the Ceph cluster to ensure it can never complete its (possible) in-flight writes. > Thanks, > Ashish > >> >> >> >>> Best regards, >>> >>> On Mar 8, 2018 11:54 PM, "Mike Christie" <mchristi@xxxxxxxxxx >>> <mailto:mchristi@xxxxxxxxxx>> wrote: >>> >>> On 03/07/2018 09:24 AM, shadow_lin wrote: >>> > Hi Christie, >>> > Is it safe to use active/passive multipath with krbd with >>> exclusive lock >>> > for lio/tgt/scst/tcmu? >>> >>> No. We tried to use lio and krbd initially, but there is a issue >>> where >>> IO might get stuck in the target/block layer and get executed after >>> new >>> IO. So for lio, tgt and tcmu it is not safe as is right now. We >>> could >>> add some code tcmu's file_example handler which can be used with >>> krbd so >>> it works like the rbd one. >>> >>> I do know enough about SCST right now. >>> >>> >>> > Is it safe to use active/active multipath If use suse kernel with >>> > target_core_rbd? >>> > Thanks. >>> > >>> > 2018-03-07 >>> > >>> >>> ------------------------------------------------------------------------ >>> > shadowlin >>> > >>> > >>> >>> ------------------------------------------------------------------------ >>> > >>> > *发件人:*Mike Christie <mchristi@xxxxxxxxxx >>> <mailto:mchristi@xxxxxxxxxx>> >>> > *发送时间:*2018-03-07 03:51 >>> > *主题:*Re: iSCSI Multipath (Load Balancing) vs RBD >>> > Exclusive Lock >>> > *收件人:*"Lazuardi Nasution"<mrxlazuardin@xxxxxxxxx >>> <mailto:mrxlazuardin@xxxxxxxxx>>,"Ceph >>> > Users"<ceph-users@xxxxxxxxxxxxxx >>> <mailto:ceph-users@xxxxxxxxxxxxxx>> >>> > *抄送:* >>> > >>> > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: >>> > > Hi, >>> > > >>> > > I want to do load balanced multipathing (multiple iSCSI >>> gateway/exporter >>> > > nodes) of iSCSI backed with RBD images. Should I disable >>> exclusive lock >>> > > feature? What if I don't disable that feature? I'm using TGT >>> (manual >>> > > way) since I get so many CPU stuck error messages when I was >>> using LIO. >>> > > >>> > >>> > You are using LIO/TGT with krbd right? >>> > >>> > You cannot or shouldn't do active/active multipathing. If you >>> have the >>> > lock enabled then it bounces between paths for each IO and >>> will be slow. >>> > If you do not have it enabled then you can end up with stale >>> IO >>> > overwriting current data. >>> > >>> > >>> > >>> > >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com