Re: iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi Mike,

> For the easy case, the SCSI command is sent directly to krbd and so if
> osd_request_timeout is less than M seconds then the command will be
> failed in time and we would not hit the problem above.
> If something happens in the target stack like the SCSI command gets
> stuck/queued then your osd_request_timeout value might be too short.

1)Currently the osd_request_timeout timer (req->r_start_stamp) is started
in osd_client.c this is late in the stack and as you mentioned things
could be stuck earlier. Is it be better to start this timer early
like in iscsi_target.c iscsit_handle_scsi_cmd() at start of processing
and propagate this value to osd_client ?
Even more accurate will be to use SO_TIMESTAMPING and timestamp the
socket buffers as they are received to compute time of current stream
position. We can also use TCP Timestamps (RFC 7323) sent from the client
initiator, which is enabled by default on Linux/Win/ESX. But this is
more work. What are your thoughts ?

2)I undertand that before switching the path, the initiator will send a
TMF ABORT can we pass this to down to the same abort_request() function
in osd_client that is used for osd_request_timeout expiry ?

Cheers /Maged

On 2018-03-08 20:44, Mike Christie wrote:

On 03/08/2018 10:59 AM, Lazuardi Nasution wrote:
Hi Mike,

Since I have moved from LIO to TGT, I can do full ALUA (active/active)
of multiple gateways. Of course I have to disable any write back cache
at any level (RBD cache and TGT cache). It seem to be safe to disable
exclusive lock since each RBD image is accessed only by single client
and as long as I know mostly ALUA use RR of I/O path.

It might be possible if you have configured your timers correctly but I
do not think anyone has figured it all out yet.

Here is a simple but long example of the problem. Sorry for the length,
but I want to make sure people know the risks.

You have 2 iscsi target nodes and 1 iscsi initiator connected to both
doing active/active over them.

To make it really easy to hit, the iscsi initiator should be connected
to the target with a different nic port or network than what is being
used for ceph traffic.

1. Prep the data. Just clear the first sector of your iscsi disk. On the
initiator system do:

dd if=/dev/zero of=/dev/sdb count=1 ofile=direct

2. Kill the network/port for one of the iscsi targets ceph traffic. So
for example on target node 1 pull its cable for ceph traffic if you set
it up where iscsi and ceph use different physical ports. iSCSI traffic
should be unaffected for this test.

3. Write some new data over the sector we just wrote in #1. This will
get sent from the initiator to the target ok, but get stuck in the
rbd/ceph layer since that network is down:

dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct

4. The initiator's eh timers will fire and that will fail and will the
command will get failed and retired on the other path. After that dd in
#3 completes run:

dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct

This should execute quickly since it goes through the good iscsi and
ceph path right away.

5. Now plug the cable back in and wait for maybe 30 seconds for the
network to come back up and the stuck command to run.

6. Now do

dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct

The data is going to be the data sent in step 3 and not the new data in
step 4.

To get around this issue you could try to set the krbd
osd_request_timeout to a value shorter than the initiator side failover
time out (for multipath-tools/open-iscsi in linux this would be
fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also
account for the transport related timers that might short circut/bypass
the TMF based EH.

One problem with trying to rely on configuring that is handling all the
corner cases. So you have:

- Transport (nop) timer or SCSI/TMF command timer set so the
fast_io_fail/replacement timer starts at N seconds and then fires at M.
- It is a really bad connection so it takes N - 1 seconds to get the
SCSI command from the initiator to target.
- At the N second mark the iscsi connection is dropped the
fast_io_fail/replacement timer is started.

For the easy case, the SCSI command is sent directly to krbd and so if
osd_request_timeout is less than M seconds then the command will be
failed in time and we would not hit the problem above.

If something happens in the target stack like the SCSI command gets
stuck/queued then your osd_request_timeout value might be too short. For
example, if you were using tgt/lio right now and this was a
COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1
seconds, and then the write part might take osd_request_timeout -1
seconds so you need to have your fast_io_fail long enough for that type
of case. For tgt a WRITE_SAME command might be N WRITEs to krbd, so you
need to make sure your queue depths are set so you do not end up with
something similar as the CAW but where M WRITEs get executed and take
osd_request_timeout -1 seconds then M more, etc and at some point the
iscsi connection is lost so the failover timer had started. Some ceph
requests also might be multiple requests.

Maybe an overly paranoid case, but I still worry about because I do not
want to mess up anyone's data, is that a disk on the iscsi target node
goes flakey. In the target we do kmalloc(GFP_KERNEL) to execute a SCSI
command, and that blocks trying to write data to the flakey disk. If the
disk recovers and we can eventually recover, did you account for the
recovery timers in that code path when configuring the failover and krbd
timers.

One other case we have been debating about is if krbd/librbd is able to
put the ceph request on the wire but then the iscsi connection goes
down, will the ceph request always get sent to the OSD before the
initiator side failover timeouts have fired and it starts using a
different target node.



Best regards,

On Mar 8, 2018 11:54 PM, "Mike Christie" <mchristi@xxxxxxxxxx
<mailto:mchristi@xxxxxxxxxx>> wrote:

    On 03/07/2018 09:24 AM, shadow_lin wrote:
    > Hi Christie,
    > Is it safe to use active/passive multipath with krbd with
    exclusive lock
    > for lio/tgt/scst/tcmu?

    No. We tried to use lio and krbd initially, but there is a issue where
    IO might get stuck in the target/block layer and get executed after new
    IO. So for lio, tgt and tcmu it is not safe as is right now. We could
    add some code tcmu's file_example handler which can be used with krbd so
    it works like the rbd one.

    I do know enough about SCST right now.


    > Is it safe to use active/active multipath If use suse kernel with
    > target_core_rbd?
    > Thanks.
    >
    > 2018-03-07
    >
    ------------------------------------------------------------------------
    > shadowlin
    >
    >
    ------------------------------------------------------------------------
    >
    >     *发件人:*Mike Christie <mchristi@xxxxxxxxxx
    <mailto:mchristi@xxxxxxxxxx>>
    >     *发送时间:*2018-03-07 03:51
    >     *主题:*Re: iSCSI Multipath (Load Balancing) vs RBD
    >     Exclusive Lock
    >     *收件人:*"Lazuardi Nasution"<mrxlazuardin@xxxxxxxxx
    <mailto:mrxlazuardin@xxxxxxxxx>>,"Ceph
    >     Users"<ceph-users@xxxxxxxxxxxxxx
    <mailto:ceph-users@xxxxxxxxxxxxxx>>
    >     *抄送:*
    >
    >     On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
    >     > Hi,
    >     >
    >     > I want to do load balanced multipathing (multiple iSCSI
    gateway/exporter
    >     > nodes) of iSCSI backed with RBD images. Should I disable
    exclusive lock
    >     > feature? What if I don't disable that feature? I'm using TGT
    (manual
    >     > way) since I get so many CPU stuck error messages when I was
    using LIO.
    >     >
    >
    >     You are using LIO/TGT with krbd right?
    >
    >     You cannot or shouldn't do active/active multipathing. If you
    have the
    >     lock enabled then it bounces between paths for each IO and
    will be slow.
    >     If you do not have it enabled then you can end up with stale IO
    >     overwriting current data.
    >
    >
    >
    >


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux