Re: iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 8 Mar 2018 14:49:17 -0500

On Thu, Mar 8, 2018 at 2:11 PM, Ashish Samant <ashish.samant@xxxxxxxxxx> wrote:
>
>
> On 03/08/2018 10:44 AM, Mike Christie wrote:
>>
>> On 03/08/2018 10:59 AM, Lazuardi Nasution wrote:
>>>
>>> Hi Mike,
>>>
>>> Since I have moved from LIO to TGT, I can do full ALUA (active/active)
>>> of multiple gateways. Of course I have to disable any write back cache
>>> at any level (RBD cache and TGT cache). It seem to be safe to disable
>>> exclusive lock since each RBD image is accessed only by single client
>>> and as long as I know mostly ALUA use RR of I/O path.
>>
>> It might be possible if you have configured your timers correctly but I
>> do not think anyone has figured it all out yet.
>>
>> Here is a simple but long example of the problem. Sorry for the length,
>> but I want to make sure people know the risks.
>>
>> You have 2 iscsi target nodes and 1 iscsi initiator connected to both
>> doing active/active over them.
>>
>> To make it really easy to hit, the iscsi initiator should be connected
>> to the target with a different nic port or network than what is being
>> used for ceph traffic.
>>
>> 1. Prep the data. Just clear the first sector of your iscsi disk. On the
>> initiator system do:
>>
>> dd if=/dev/zero of=/dev/sdb count=1 ofile=direct
>>
>> 2. Kill the network/port for one of the iscsi targets ceph traffic. So
>> for example on target node 1 pull its cable for ceph traffic if you set
>> it up where iscsi and ceph use different physical ports. iSCSI traffic
>> should be unaffected for this test.
>>
>> 3. Write some new data over the sector we just wrote in #1. This will
>> get sent from the initiator to the target ok, but get stuck in the
>> rbd/ceph layer since that network is down:
>>
>> dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct
>>
>> 4. The initiator's eh timers will fire and that will fail and will the
>> command will get failed and retired on the other path. After that dd in
>> #3 completes run:
>>
>> dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct
>>
>> This should execute quickly since it goes through the good iscsi and
>> ceph path right away.
>>
>> 5. Now plug the cable back in and wait for maybe 30 seconds for the
>> network to come back up and the stuck command to run.
>>
>> 6. Now do
>>
>> dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct
>>
>> The data is going to be the data sent in step 3 and not the new data in
>> step 4.
>>
>> To get around this issue you could try to set the krbd
>> osd_request_timeout to a value shorter than the initiator side failover
>> time out (for multipath-tools/open-iscsi in linux this would be
>> fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also
>> account for the transport related timers that might short circut/bypass
>> the TMF based EH.
>>
>> One problem with trying to rely on configuring that is handling all the
>> corner cases. So you have:
>>
>> - Transport (nop) timer or SCSI/TMF command timer set so the
>> fast_io_fail/replacement timer starts at N seconds and then fires at M.
>> - It is a really bad connection so it takes N - 1 seconds to get the
>> SCSI command from the initiator to target.
>> - At the N second mark the iscsi connection is dropped the
>> fast_io_fail/replacement timer is started.
>>
>> For the easy case, the SCSI command is sent directly to krbd and so if
>> osd_request_timeout is less than M seconds then the command will be
>> failed in time and we would not hit the problem above.
>>
>> If something happens in the target stack like the SCSI command gets
>> stuck/queued then your osd_request_timeout value might be too short. For
>> example, if you were using tgt/lio right now and this was a
>> COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1
>> seconds, and then the write part might take osd_request_timeout -1
>> seconds so you need to have your fast_io_fail long enough for that type
>> of case. For tgt a WRITE_SAME command might be N WRITEs to krbd, so you
>> need to make sure your queue depths are set so you do not end up with
>> something similar as the CAW but where M WRITEs get executed and take
>> osd_request_timeout -1 seconds then M more, etc and at some point the
>> iscsi connection is lost so the failover timer had started. Some ceph
>> requests also might be multiple requests.
>>
>> Maybe an overly paranoid case, but I still worry about because I do not
>> want to mess up anyone's data, is that a disk on the iscsi target node
>> goes flakey. In the target we do kmalloc(GFP_KERNEL) to execute a SCSI
>> command, and that blocks trying to write data to the flakey disk. If the
>> disk recovers and we can eventually recover, did you account for the
>> recovery timers in that code path when configuring the failover and krbd
>> timers.
>>
>> One other case we have been debating about is if krbd/librbd is able to
>> put the ceph request on the wire but then the iscsi connection goes
>> down, will the ceph request always get sent to the OSD before the
>> initiator side failover timeouts have fired and it starts using a
>> different target node.
>
> If krbd/librbd is able to put the ceph request on the wire, then that could
> cause data corruption in the
> active/passive case too, right?

In general, yes. However, that's why the LIO/librbd approach uses the
RBD exclusive-lock feature in combination w/ Ceph client blacklisting
to ensure that cannot occur. Upon path failover, the old RBD client is
blacklisted from the Ceph cluster to ensure it can never complete its
(possible) in-flight writes.

> Thanks,
> Ashish
>
>>
>>
>>
>>> Best regards,
>>>
>>> On Mar 8, 2018 11:54 PM, "Mike Christie" <mchristi@xxxxxxxxxx
>>> <mailto:mchristi@xxxxxxxxxx>> wrote:
>>>
>>>      On 03/07/2018 09:24 AM, shadow_lin wrote:
>>>      > Hi Christie,
>>>      > Is it safe to use active/passive multipath with krbd with
>>>      exclusive lock
>>>      > for lio/tgt/scst/tcmu?
>>>
>>>      No. We tried to use lio and krbd initially, but there is a issue
>>> where
>>>      IO might get stuck in the target/block layer and get executed after
>>> new
>>>      IO. So for lio, tgt and tcmu it is not safe as is right now. We
>>> could
>>>      add some code tcmu's file_example handler which can be used with
>>> krbd so
>>>      it works like the rbd one.
>>>
>>>      I do know enough about SCST right now.
>>>
>>>
>>>      > Is it safe to use active/active multipath If use suse kernel with
>>>      > target_core_rbd?
>>>      > Thanks.
>>>      >
>>>      > 2018-03-07
>>>      >
>>>
>>> ------------------------------------------------------------------------
>>>      > shadowlin
>>>      >
>>>      >
>>>
>>> ------------------------------------------------------------------------
>>>      >
>>>      >     *发件人：*Mike Christie <mchristi@xxxxxxxxxx
>>>      <mailto:mchristi@xxxxxxxxxx>>
>>>      >     *发送时间：*2018-03-07 03:51
>>>      >     *主题：*Re:  iSCSI Multipath (Load Balancing) vs RBD
>>>      >     Exclusive Lock
>>>      >     *收件人：*"Lazuardi Nasution"<mrxlazuardin@xxxxxxxxx
>>>      <mailto:mrxlazuardin@xxxxxxxxx>>,"Ceph
>>>      >     Users"<ceph-users@xxxxxxxxxxxxxx
>>>      <mailto:ceph-users@xxxxxxxxxxxxxx>>
>>>      >     *抄送：*
>>>      >
>>>      >     On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
>>>      >     > Hi,
>>>      >     >
>>>      >     > I want to do load balanced multipathing (multiple iSCSI
>>>      gateway/exporter
>>>      >     > nodes) of iSCSI backed with RBD images. Should I disable
>>>      exclusive lock
>>>      >     > feature? What if I don't disable that feature? I'm using TGT
>>>      (manual
>>>      >     > way) since I get so many CPU stuck error messages when I was
>>>      using LIO.
>>>      >     >
>>>      >
>>>      >     You are using LIO/TGT with krbd right?
>>>      >
>>>      >     You cannot or shouldn't do active/active multipathing. If you
>>>      have the
>>>      >     lock enabled then it bounces between paths for each IO and
>>>      will be slow.
>>>      >     If you do not have it enabled then you can end up with stale
>>> IO
>>>      >     overwriting current data.
>>>      >
>>>      >
>>>      >
>>>      >
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com