Re: commit ab118da4c10a70b8437f5c90ab77adae1835963e causes ib_srpt to fail connections served by target LIO

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 16, 2020 at 09:21:40AM +0200, Leon Romanovsky wrote:
> On Sun, Mar 15, 2020 at 05:56:17PM -0400, Laurence Oberman wrote:
> > On Sun, 2020-03-15 at 17:01 -0400, Laurence Oberman wrote:
> > > On Sun, 2020-03-15 at 22:40 +0200, Max Gurtovoy wrote:
> > > > On 3/15/2020 8:36 PM, Laurence Oberman wrote:
> > > > > On Sun, 2020-03-15 at 20:20 +0200, Max Gurtovoy wrote:
> > > > > > On 3/15/2020 7:59 PM, Laurence Oberman wrote:
> > > > > > > On Sun, 2020-03-15 at 18:47 +0200, Max Gurtovoy wrote:
> > > > > > > > On 3/14/2020 11:30 PM, Laurence Oberman wrote:
> > > > > > > > > Hello Bart, Leon and Max
> > > > > > > >
> > > > > > > > Hi Laurence,
> > > > > > > >
> > > > > > > > thanks for the great analysis and the fast response !!
> > > > > > > >
> > > > > > > > > Max had reached out to me to test a new set of patches
> > > > > > > > > for
> > > > > > > > > SRQ.
> > > > > > > > > I had not tested upstream ib_srpt on an LIO target for
> > > > > > > > > quite a
> > > > > > > > > while,
> > > > > > > > > only ib_srp client tests had been run of late.
> > > > > > > > > During a baseline test before applying Max's patches it
> > > > > > > > > was
> > > > > > > > > apparent
> > > > > > > > > that something had broken ib_srpt connections within LIO
> > > > > > > > > target
> > > > > > > > > since
> > > > > > > > > 5.5.
> > > > > > > > >
> > > > > > > > > Note thet ib_srp client connectivity with the commit
> > > > > > > > > functions
> > > > > > > > > fine,
> > > > > > > > > it's just the target that breaks with this commit.
> > > > > > > > >
> > > > > > > > > After a long bisect this is the commit that seems to
> > > > > > > > > break
> > > > > > > > > it.
> > > > > > > > > While it's not directly code in ib_srpt, its code in mlx5
> > > > > > > > > vport
> > > > > > > > > ethernet connectivity that then breaks ib_srpt
> > > > > > > > > connectivity
> > > > > > > > > over
> > > > > > > > > mlx5
> > > > > > > > > IB RDMA with LIO.
> > > > > > > >
> > > > > > > > I was able to connect in loopback and also from remote
> > > > > > > > initiator
> > > > > > > > with
> > > > > > > > this commit.
> > > > > > > >
> > > > > > > > So I'm not sure that this commit is broken.
> > > > > > > >
> > > > > > > > I used Bart's scripts to configure the target and to
> > > > > > > > connect
> > > > > > > > to
> > > > > > > > it
> > > > > > > > in
> > > > > > > > loopback (after some modifications for the updated
> > > > > > > > kernel/sysfs/configfs
> > > > > > > > interface).
> > > > > > > >
> > > > > > > > I did see an issue to connect from remote initiator, but
> > > > > > > > after
> > > > > > > > reloading
> > > > > > > > openibd in the initiator side I was able to connect.
> > > > > > > >
> > > > > > > > So I suspect you had the same issue - that also should be
> > > > > > > > debugged.
> > > > > > > >
> > > > > > > > > I will let Leon and others decide but reverting the below
> > > > > > > > > commit
> > > > > > > > > allows
> > > > > > > > > SRP connectivity to an LIO target to work again.
> > > > > > > >
> > > > > > > > I added prints to "mlx5_core_modify_hca_vport_context"
> > > > > > > > function
> > > > > > > > and
> > > > > > > > found that we don't call it in "pure" mlx5 mode with PFs.
> > > > > > > >
> > > > > > > > Maybe you can try it too...
> > > > > > > >
> > > > > > > > I was able to check my patches on my system and I'll send
> > > > > > > > them
> > > > > > > > soon.
> > > > > > > >
> > > > > > > > Thanks again Laurence and Bart.
> > > > > > > >
> > > > > > > > > Max, I will test your new patches once we have a decision
> > > > > > > > > on
> > > > > > > > > this.
> > > > > > > > >
> > > > > > > > > Client
> > > > > > > > > Linux ibclient.lab.eng.bos.redhat.com 5.6.0-rc5+ #1 SMP
> > > > > > > > > Thu
> > > > > > > > > Mar
> > > > > > > > > 12
> > > > > > > > > 16:58:19 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > > > >
> > > > > > > > > Server with reverted commit
> > > > > > > > > Linux fedstorage.bos.redhat.com 5.6.0-rc5+ #1 SMP Sat Mar
> > > > > > > > > 14
> > > > > > > > > 16:39:35
> > > > > > > > > EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > > > >
> > > > > > > > > commit ab118da4c10a70b8437f5c90ab77adae1835963e
> > > > > > > > > Author: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > > > Date:   Wed Nov 13 12:03:47 2019 +0200
> > > > > > > > >
> > > > > > > > >        net/mlx5: Don't write read-only fields in
> > > > > > > > > MODIFY_HCA_VPORT_CONTEXT
> > > > > > > > > command
> > > > > > > > >
> > > > > > > > >        The MODIFY_HCA_VPORT_CONTEXT uses field_selector
> > > > > > > > > to
> > > > > > > > > mask
> > > > > > > > > fields
> > > > > > > > > needed
> > > > > > > > >        to be written, other fields are required to be
> > > > > > > > > zero
> > > > > > > > > according
> > > > > > > > > to
> > > > > > > > > the
> > > > > > > > >        HW specification. The supported fields are
> > > > > > > > > controlled by
> > > > > > > > > bitfield
> > > > > > > > >        and limited to vport state, node and port GUIDs.
> > > > > > > > >
> > > > > > > > >        Signed-off-by: Leon Romanovsky <
> > > > > > > > > leonro@xxxxxxxxxxxx>
> > > > > > > > >        Signed-off-by: Saeed Mahameed <saeedm@xxxxxxxxxxxx
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > diff --git
> > > > > > > > > a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> > > > > > > > > b/drivers/net/ethernet/mellanox/mlx5
> > > > > > > > > index 30f7848..1faac31f 100644
> > > > > > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> > > > > > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> > > > > > > > > @@ -1064,26 +1064,13 @@ int
> > > > > > > > > mlx5_core_modify_hca_vport_context(struct
> > > > > > > > > mlx5_core_dev *dev,
> > > > > > > > >
> > > > > > > > >            ctx =
> > > > > > > > > MLX5_ADDR_OF(modify_hca_vport_context_in,
> > > > > > > > > in,
> > > > > > > > > hca_vport_context);
> > > > > > > > >            MLX5_SET(hca_vport_context, ctx, field_select,
> > > > > > > > > req-
> > > > > > > > > > field_select);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, sm_virt_aware,
> > > > > > > > > req-
> > > > > > > > > > sm_virt_aware);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, has_smi, req-
> > > > > > > > > > has_smi);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, has_raw, req-
> > > > > > > > > > has_raw);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > vport_state_policy,
> > > > > > > > > req-
> > > > > > > > > > policy);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > port_physical_state,
> > > > > > > > > req-
> > > > > > > > > > phys_state);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, vport_state,
> > > > > > > > > req-
> > > > > > > > > > vport_state);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET64(hca_vport_context, ctx, port_guid,
> > > > > > > > > req-
> > > > > > > > > > port_guid);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET64(hca_vport_context, ctx, node_guid,
> > > > > > > > > req-
> > > > > > > > > > node_guid);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, cap_mask1, req-
> > > > > > > > > > cap_mask1);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > cap_mask1_field_select,
> > > > > > > > > req-
> > > > > > > > > > cap_mask1_perm);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, cap_mask2, req-
> > > > > > > > > > cap_mask2);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > cap_mask2_field_select,
> > > > > > > > > req-
> > > > > > > > > > cap_mask2_perm);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, lid, req->lid);
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, init_type_reply,
> > > > > > > > > req-
> > > > > > > > > > init_type_reply);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, lmc, req->lmc);
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, subnet_timeout,
> > > > > > > > > req-
> > > > > > > > > > subnet_timeout);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, sm_lid, req-
> > > > > > > > > > sm_lid);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, sm_sl, req-
> > > > > > > > > > sm_sl);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > qkey_violation_counter,
> > > > > > > > > req-
> > > > > > > > > > qkey_violation_counter);
> > > > > > > > >
> > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > pkey_violation_counter,
> > > > > > > > > req-
> > > > > > > > > > pkey_violation_counter);
> > > > > > > > >
> > > > > > > > > +       if (req->field_select &
> > > > > > > > > MLX5_HCA_VPORT_SEL_STATE_POLICY)
> > > > > > > > > +               MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > vport_state_policy,
> > > > > > > > > +                        req->policy);
> > > > > > > > > +       if (req->field_select &
> > > > > > > > > MLX5_HCA_VPORT_SEL_PORT_GUID)
> > > > > > > > > +               MLX5_SET64(hca_vport_context, ctx,
> > > > > > > > > port_guid,
> > > > > > > > > req-
> > > > > > > > > > port_guid);
> > > > > > > > >
> > > > > > > > > +       if (req->field_select &
> > > > > > > > > MLX5_HCA_VPORT_SEL_NODE_GUID)
> > > > > > > > > +               MLX5_SET64(hca_vport_context, ctx,
> > > > > > > > > node_guid,
> > > > > > > > > req-
> > > > > > > > > > node_guid);
> > > > > > > > >
> > > > > > > > >            err = mlx5_cmd_exec(dev, in, in_sz, out,
> > > > > > > > > sizeof(out));
> > > > > > > > >     ex:
> > > > > > > > >            kfree(in);
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > > Hi Max
> > > > > > > Re:
> > > > > > >
> > > > > > > "
> > > > > > > So I'm not sure that this commit is broken.
> > > > > > > ..
> > > > > > > ..
> > > > > > > I added prints to "mlx5_core_modify_hca_vport_context"
> > > > > > > function
> > > > > > > and
> > > > > > > found that we don't call it in "pure" mlx5 mode with PFs.
> > > > > > >
> > > > > > > Maybe you can try it too...
> > > > > > > "
> > > > > > >
> > > > > > > The thing is without this commit we connect immediately, no
> > > > > > > delay
> > > > > > > no
> > > > > > > issue and I am changing nothing else other than reverting
> > > > > > > here.
> > > > > > >
> > > > > > > So this clearly has a bearing directly on the functionality.
> > > > > > >
> > > > > > > I will look at adding more debug, but with this commit in
> > > > > > > there
> > > > > > > is
> > > > > > > nor
> > > > > > > evidence even of an attempt to connect and fail.
> > > > > > >
> > > > > > > Its silently faling.
> > > > > >
> > > > > > please send me all the configuration steps you run after
> > > > > > booting
> > > > > > the
> > > > > > target + steps in the initiator (can be also in attached file).
> > > > > >
> > > > > > I'll try to follow this.
> > > > > >
> > > > > > Btw, did you try loopback initiator ?
> > > > > >
> > > > > > -Max.
> > > > > >
> > > > > >
> > > > > > > Regards
> > > > > > > Laurence
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > > Hi Max
> > > > >
> > > > > Did not try loopback because here we have actual physical
> > > > > connectity as
> > > > > that is what our customers use.
> > > > >
> > > > > Connected back to back with MLX5 cx4 HCA dual ports at EDR 100
> > > > > Thi sis my standard configuration used for all upstream and Red
> > > > > Hat
> > > > > kernel testing.
> > > > >
> > > > > Reboot server and client and then first prepare server
> > > > >
> > > > > Server
> > > > > ----------
> > > > >
> > > > > the prepare.sh script run is after boot on the server
> > > > >
> > > > >
> > > > > #!/bin/bash
> > > > > ./load_modules.sh
> > > > > ./create_ramdisk.sh
> > > > > targetcli restoreconfig
> > > > > # Set the srp_sq_size
> > > > > for i in
> > > > > /sys/kernel/config/target/srpt/0xfe800000000000007cfe900300726e4e
> > > > > /sys/kernel/config/target/srpt/0xfe800000000000007cfe900300726e4f
> > > > > do
> > > > > 	echo 16384 > $i/tpgt_1/attrib/srp_sq_size
> > > > > done
> > > > >
> > > > > [root@fedstorage ~]# cat load_modules.sh
> > > > > #!/bin/bash
> > > > > modprobe mlx5_ib
> > > > > modprobe ib_srpt
> > > > > modprobe ib_srp
> > > > > modprobe ib_umad
> > > > >
> > > > > [root@fedstorage ~]# cat ./create_ramdisk.sh
> > > > > #!/bin/bash
> > > > > mount -t tmpfs -o size=130g tmpfs /mnt
> > > > > cd /mnt
> > > > > for i in `seq 1 30`; do dd if=/dev/zero of=block-$i bs=1024k
> > > > > count=4000
> > > > > ; done
> > > > >
> > > > >
> > > > >
> > > > > Client
> > > > > --------
> > > > >
> > > > > Once server is ready
> > > > >
> > > > > Run ./start_opensm.sh on client (I sont use the SM on a switch as
> > > > > we
> > > > > are back to back)
> > > > >
> > > > > [root@ibclient ~]# cat ./start_opensm.sh
> > > > > #!/bin/bash
> > > > > rmmod ib_srpt
> > > > > opensm -F opensm.1.conf &
> > > > > opensm -F opensm.2.conf &
> > > > >
> > > > > I will semail the conf only to you as well as the targecli config
> > > > > as th
> > > > > eout is long.
> > > > >
> > > > >
> > > > > Then run start_srp.sh
> > > > >
> > > > > [root@ibclient ~]# cat ./start_srp.sh
> > > > > run_srp_daemon  -V -f /etc/ddn/srp_daemon.conf -R 30 -T 10 -t
> > > > > 7000
> > > > > -ance -i mlx5_0 -p 1 1>/root/srp1.log 2>&1 &
> > > > > run_srp_daemon  -V -f /etc/ddn/srp_daemon.conf -R 30 -T 10 -t
> > > > > 7000
> > > > > -ance -i mlx5_1 -p 1 1>/root/srp2.log 2>&1 &
> > > > >
> > > > > [root@ibclient ~]# cat /etc/ddn/srp_daemon.conf
> > > > > a      queue_size=128,max_cmd_per_lun=32,max_sect=32768
> > > > >
> > > > >
> > > >
> > > > I see that you're link is IB.
> > > >
> > > > I was working on RoCE link layer with rdma_cm.
> > > >
> > > > I'll try to find some free IB setup tomorrow in my lab..
> > > >
> > > > Can you try login using rdma_cm ? need to load ib_ipoib for that in
> > > > IB
> > > > network.
> > > >
> > > > I'm trying to understand whether it's related to the link layer.
> > > >
> > > > p.s. I think the target configuration file didn't arrive.
> > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > > Max,
> > >
> > > Yes, I am, working primarily with SCSI over RDMA Protocol with
> > > Infiniband HCA's in IB mode.
> > > I am not using ROCE.
> > >
> > > Also lets not forget this is a target only issue, the latest 5.6
> > > kernel
> > > runs untouched with no issues on the initiator when the target runs
> > > either 5.4 or 5.6 with the revert.
> > > It would run fine with the target on 5.5 as well if I reverted the
> > > commit on 5.5 too.
> > >
> > > I am not able at this time to switch these adapters to Ethernet mode
> > > and ROCE
> > >
> > > The weird thing is the failure is completely silent so something in
> > > the
> > > Link layer with IB has to failing early.
> > > Thje other strange observation is that the IB interfaces come up with
> > > no issue.
> > > I will try add some debug after reboot into the failing kernel.
> > >
> > > Regards
> > > Laurence
> > >
> > >
> > >
> >
> > Max,
> > Rupesh in cc here will be testing latest upstream on a client/server
> > configuration with ROCE and report back here on if he sees a similar
> > issue with the LIO target with that commit.
> >
> > I will continue working on the IB srpt issue by adding some debug.
> >
> > If you think about anything related to the commit let me know.
>
> Laurence,
>
> As I said above, the most chances are that I removed some RW
> initialization that wasn't protected by field_select and wasn't
> marked in our PRM as RW field.
>
> The question is which one.

I think that I know what is missing.
Can you please try this patch?

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 1faac31f74d0..23f879da9104 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -1071,6 +1071,9 @@ int mlx5_core_modify_hca_vport_context(struct mlx5_core_dev *dev,
                MLX5_SET64(hca_vport_context, ctx, port_guid, req->port_guid);
        if (req->field_select & MLX5_HCA_VPORT_SEL_NODE_GUID)
                MLX5_SET64(hca_vport_context, ctx, node_guid, req->node_guid);
+       MLX5_SET(hca_vport_context, ctx, cap_mask1, req->cap_mask1);
+       MLX5_SET(hca_vport_context, ctx, cap_mask1_field_select,
+                req->cap_mask1_perm);
        err = mlx5_cmd_exec(dev, in, in_sz, out, sizeof(out));
 ex:
        kfree(in);


>
> Thanks
>
> >
> > Regards
> > Laurence
> >



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux