Re: commit ab118da4c10a70b8437f5c90ab77adae1835963e causes ib_srpt to fail connections served by target LIO

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2020-03-16 at 09:30 +0200, Leon Romanovsky wrote:
> On Mon, Mar 16, 2020 at 09:21:40AM +0200, Leon Romanovsky wrote:
> > On Sun, Mar 15, 2020 at 05:56:17PM -0400, Laurence Oberman wrote:
> > > On Sun, 2020-03-15 at 17:01 -0400, Laurence Oberman wrote:
> > > > On Sun, 2020-03-15 at 22:40 +0200, Max Gurtovoy wrote:
> > > > > On 3/15/2020 8:36 PM, Laurence Oberman wrote:
> > > > > > On Sun, 2020-03-15 at 20:20 +0200, Max Gurtovoy wrote:
> > > > > > > On 3/15/2020 7:59 PM, Laurence Oberman wrote:
> > > > > > > > On Sun, 2020-03-15 at 18:47 +0200, Max Gurtovoy wrote:
> > > > > > > > > On 3/14/2020 11:30 PM, Laurence Oberman wrote:
> > > > > > > > > > Hello Bart, Leon and Max
> > > > > > > > > 
> > > > > > > > > Hi Laurence,
> > > > > > > > > 
> > > > > > > > > thanks for the great analysis and the fast response
> > > > > > > > > !!
> > > > > > > > > 
> > > > > > > > > > Max had reached out to me to test a new set of
> > > > > > > > > > patches
> > > > > > > > > > for
> > > > > > > > > > SRQ.
> > > > > > > > > > I had not tested upstream ib_srpt on an LIO target
> > > > > > > > > > for
> > > > > > > > > > quite a
> > > > > > > > > > while,
> > > > > > > > > > only ib_srp client tests had been run of late.
> > > > > > > > > > During a baseline test before applying Max's
> > > > > > > > > > patches it
> > > > > > > > > > was
> > > > > > > > > > apparent
> > > > > > > > > > that something had broken ib_srpt connections
> > > > > > > > > > within LIO
> > > > > > > > > > target
> > > > > > > > > > since
> > > > > > > > > > 5.5.
> > > > > > > > > > 
> > > > > > > > > > Note thet ib_srp client connectivity with the
> > > > > > > > > > commit
> > > > > > > > > > functions
> > > > > > > > > > fine,
> > > > > > > > > > it's just the target that breaks with this commit.
> > > > > > > > > > 
> > > > > > > > > > After a long bisect this is the commit that seems
> > > > > > > > > > to
> > > > > > > > > > break
> > > > > > > > > > it.
> > > > > > > > > > While it's not directly code in ib_srpt, its code
> > > > > > > > > > in mlx5
> > > > > > > > > > vport
> > > > > > > > > > ethernet connectivity that then breaks ib_srpt
> > > > > > > > > > connectivity
> > > > > > > > > > over
> > > > > > > > > > mlx5
> > > > > > > > > > IB RDMA with LIO.
> > > > > > > > > 
> > > > > > > > > I was able to connect in loopback and also from
> > > > > > > > > remote
> > > > > > > > > initiator
> > > > > > > > > with
> > > > > > > > > this commit.
> > > > > > > > > 
> > > > > > > > > So I'm not sure that this commit is broken.
> > > > > > > > > 
> > > > > > > > > I used Bart's scripts to configure the target and to
> > > > > > > > > connect
> > > > > > > > > to
> > > > > > > > > it
> > > > > > > > > in
> > > > > > > > > loopback (after some modifications for the updated
> > > > > > > > > kernel/sysfs/configfs
> > > > > > > > > interface).
> > > > > > > > > 
> > > > > > > > > I did see an issue to connect from remote initiator,
> > > > > > > > > but
> > > > > > > > > after
> > > > > > > > > reloading
> > > > > > > > > openibd in the initiator side I was able to connect.
> > > > > > > > > 
> > > > > > > > > So I suspect you had the same issue - that also
> > > > > > > > > should be
> > > > > > > > > debugged.
> > > > > > > > > 
> > > > > > > > > > I will let Leon and others decide but reverting the
> > > > > > > > > > below
> > > > > > > > > > commit
> > > > > > > > > > allows
> > > > > > > > > > SRP connectivity to an LIO target to work again.
> > > > > > > > > 
> > > > > > > > > I added prints to
> > > > > > > > > "mlx5_core_modify_hca_vport_context"
> > > > > > > > > function
> > > > > > > > > and
> > > > > > > > > found that we don't call it in "pure" mlx5 mode with
> > > > > > > > > PFs.
> > > > > > > > > 
> > > > > > > > > Maybe you can try it too...
> > > > > > > > > 
> > > > > > > > > I was able to check my patches on my system and I'll
> > > > > > > > > send
> > > > > > > > > them
> > > > > > > > > soon.
> > > > > > > > > 
> > > > > > > > > Thanks again Laurence and Bart.
> > > > > > > > > 
> > > > > > > > > > Max, I will test your new patches once we have a
> > > > > > > > > > decision
> > > > > > > > > > on
> > > > > > > > > > this.
> > > > > > > > > > 
> > > > > > > > > > Client
> > > > > > > > > > Linux ibclient.lab.eng.bos.redhat.com 5.6.0-rc5+ #1
> > > > > > > > > > SMP
> > > > > > > > > > Thu
> > > > > > > > > > Mar
> > > > > > > > > > 12
> > > > > > > > > > 16:58:19 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > > > > > 
> > > > > > > > > > Server with reverted commit
> > > > > > > > > > Linux fedstorage.bos.redhat.com 5.6.0-rc5+ #1 SMP
> > > > > > > > > > Sat Mar
> > > > > > > > > > 14
> > > > > > > > > > 16:39:35
> > > > > > > > > > EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > > > > > 
> > > > > > > > > > commit ab118da4c10a70b8437f5c90ab77adae1835963e
> > > > > > > > > > Author: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > > > > Date:   Wed Nov 13 12:03:47 2019 +0200
> > > > > > > > > > 
> > > > > > > > > >        net/mlx5: Don't write read-only fields in
> > > > > > > > > > MODIFY_HCA_VPORT_CONTEXT
> > > > > > > > > > command
> > > > > > > > > > 
> > > > > > > > > >        The MODIFY_HCA_VPORT_CONTEXT uses
> > > > > > > > > > field_selector
> > > > > > > > > > to
> > > > > > > > > > mask
> > > > > > > > > > fields
> > > > > > > > > > needed
> > > > > > > > > >        to be written, other fields are required to
> > > > > > > > > > be
> > > > > > > > > > zero
> > > > > > > > > > according
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > >        HW specification. The supported fields are
> > > > > > > > > > controlled by
> > > > > > > > > > bitfield
> > > > > > > > > >        and limited to vport state, node and port
> > > > > > > > > > GUIDs.
> > > > > > > > > > 
> > > > > > > > > >        Signed-off-by: Leon Romanovsky <
> > > > > > > > > > leonro@xxxxxxxxxxxx>
> > > > > > > > > >        Signed-off-by: Saeed Mahameed <
> > > > > > > > > > saeedm@xxxxxxxxxxxx
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > diff --git
> > > > > > > > > > a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> > > > > > > > > > b/drivers/net/ethernet/mellanox/mlx5
> > > > > > > > > > index 30f7848..1faac31f 100644
> > > > > > > > > > ---
> > > > > > > > > > a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> > > > > > > > > > +++
> > > > > > > > > > b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> > > > > > > > > > @@ -1064,26 +1064,13 @@ int
> > > > > > > > > > mlx5_core_modify_hca_vport_context(struct
> > > > > > > > > > mlx5_core_dev *dev,
> > > > > > > > > > 
> > > > > > > > > >            ctx =
> > > > > > > > > > MLX5_ADDR_OF(modify_hca_vport_context_in,
> > > > > > > > > > in,
> > > > > > > > > > hca_vport_context);
> > > > > > > > > >            MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > field_select,
> > > > > > > > > > req-
> > > > > > > > > > > field_select);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > sm_virt_aware,
> > > > > > > > > > req-
> > > > > > > > > > > sm_virt_aware);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, has_smi,
> > > > > > > > > > req-
> > > > > > > > > > > has_smi);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, has_raw,
> > > > > > > > > > req-
> > > > > > > > > > > has_raw);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > vport_state_policy,
> > > > > > > > > > req-
> > > > > > > > > > > policy);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > port_physical_state,
> > > > > > > > > > req-
> > > > > > > > > > > phys_state);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > vport_state,
> > > > > > > > > > req-
> > > > > > > > > > > vport_state);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET64(hca_vport_context, ctx,
> > > > > > > > > > port_guid,
> > > > > > > > > > req-
> > > > > > > > > > > port_guid);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET64(hca_vport_context, ctx,
> > > > > > > > > > node_guid,
> > > > > > > > > > req-
> > > > > > > > > > > node_guid);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, cap_mask1,
> > > > > > > > > > req-
> > > > > > > > > > > cap_mask1);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > cap_mask1_field_select,
> > > > > > > > > > req-
> > > > > > > > > > > cap_mask1_perm);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, cap_mask2,
> > > > > > > > > > req-
> > > > > > > > > > > cap_mask2);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > cap_mask2_field_select,
> > > > > > > > > > req-
> > > > > > > > > > > cap_mask2_perm);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, lid, req-
> > > > > > > > > > >lid);
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > init_type_reply,
> > > > > > > > > > req-
> > > > > > > > > > > init_type_reply);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, lmc, req-
> > > > > > > > > > >lmc);
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > subnet_timeout,
> > > > > > > > > > req-
> > > > > > > > > > > subnet_timeout);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, sm_lid,
> > > > > > > > > > req-
> > > > > > > > > > > sm_lid);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx, sm_sl,
> > > > > > > > > > req-
> > > > > > > > > > > sm_sl);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > qkey_violation_counter,
> > > > > > > > > > req-
> > > > > > > > > > > qkey_violation_counter);
> > > > > > > > > > 
> > > > > > > > > > -       MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > pkey_violation_counter,
> > > > > > > > > > req-
> > > > > > > > > > > pkey_violation_counter);
> > > > > > > > > > 
> > > > > > > > > > +       if (req->field_select &
> > > > > > > > > > MLX5_HCA_VPORT_SEL_STATE_POLICY)
> > > > > > > > > > +               MLX5_SET(hca_vport_context, ctx,
> > > > > > > > > > vport_state_policy,
> > > > > > > > > > +                        req->policy);
> > > > > > > > > > +       if (req->field_select &
> > > > > > > > > > MLX5_HCA_VPORT_SEL_PORT_GUID)
> > > > > > > > > > +               MLX5_SET64(hca_vport_context, ctx,
> > > > > > > > > > port_guid,
> > > > > > > > > > req-
> > > > > > > > > > > port_guid);
> > > > > > > > > > 
> > > > > > > > > > +       if (req->field_select &
> > > > > > > > > > MLX5_HCA_VPORT_SEL_NODE_GUID)
> > > > > > > > > > +               MLX5_SET64(hca_vport_context, ctx,
> > > > > > > > > > node_guid,
> > > > > > > > > > req-
> > > > > > > > > > > node_guid);
> > > > > > > > > > 
> > > > > > > > > >            err = mlx5_cmd_exec(dev, in, in_sz, out,
> > > > > > > > > > sizeof(out));
> > > > > > > > > >     ex:
> > > > > > > > > >            kfree(in);
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > 
> > > > > > > > Hi Max
> > > > > > > > Re:
> > > > > > > > 
> > > > > > > > "
> > > > > > > > So I'm not sure that this commit is broken.
> > > > > > > > ..
> > > > > > > > ..
> > > > > > > > I added prints to "mlx5_core_modify_hca_vport_context"
> > > > > > > > function
> > > > > > > > and
> > > > > > > > found that we don't call it in "pure" mlx5 mode with
> > > > > > > > PFs.
> > > > > > > > 
> > > > > > > > Maybe you can try it too...
> > > > > > > > "
> > > > > > > > 
> > > > > > > > The thing is without this commit we connect
> > > > > > > > immediately, no
> > > > > > > > delay
> > > > > > > > no
> > > > > > > > issue and I am changing nothing else other than
> > > > > > > > reverting
> > > > > > > > here.
> > > > > > > > 
> > > > > > > > So this clearly has a bearing directly on the
> > > > > > > > functionality.
> > > > > > > > 
> > > > > > > > I will look at adding more debug, but with this commit
> > > > > > > > in
> > > > > > > > there
> > > > > > > > is
> > > > > > > > nor
> > > > > > > > evidence even of an attempt to connect and fail.
> > > > > > > > 
> > > > > > > > Its silently faling.
> > > > > > > 
> > > > > > > please send me all the configuration steps you run after
> > > > > > > booting
> > > > > > > the
> > > > > > > target + steps in the initiator (can be also in attached
> > > > > > > file).
> > > > > > > 
> > > > > > > I'll try to follow this.
> > > > > > > 
> > > > > > > Btw, did you try loopback initiator ?
> > > > > > > 
> > > > > > > -Max.
> > > > > > > 
> > > > > > > 
> > > > > > > > Regards
> > > > > > > > Laurence
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > > > Hi Max
> > > > > > 
> > > > > > Did not try loopback because here we have actual physical
> > > > > > connectity as
> > > > > > that is what our customers use.
> > > > > > 
> > > > > > Connected back to back with MLX5 cx4 HCA dual ports at EDR
> > > > > > 100
> > > > > > Thi sis my standard configuration used for all upstream and
> > > > > > Red
> > > > > > Hat
> > > > > > kernel testing.
> > > > > > 
> > > > > > Reboot server and client and then first prepare server
> > > > > > 
> > > > > > Server
> > > > > > ----------
> > > > > > 
> > > > > > the prepare.sh script run is after boot on the server
> > > > > > 
> > > > > > 
> > > > > > #!/bin/bash
> > > > > > ./load_modules.sh
> > > > > > ./create_ramdisk.sh
> > > > > > targetcli restoreconfig
> > > > > > # Set the srp_sq_size
> > > > > > for i in
> > > > > > /sys/kernel/config/target/srpt/0xfe800000000000007cfe900300
> > > > > > 726e4e
> > > > > > /sys/kernel/config/target/srpt/0xfe800000000000007cfe900300
> > > > > > 726e4f
> > > > > > do
> > > > > > 	echo 16384 > $i/tpgt_1/attrib/srp_sq_size
> > > > > > done
> > > > > > 
> > > > > > [root@fedstorage ~]# cat load_modules.sh
> > > > > > #!/bin/bash
> > > > > > modprobe mlx5_ib
> > > > > > modprobe ib_srpt
> > > > > > modprobe ib_srp
> > > > > > modprobe ib_umad
> > > > > > 
> > > > > > [root@fedstorage ~]# cat ./create_ramdisk.sh
> > > > > > #!/bin/bash
> > > > > > mount -t tmpfs -o size=130g tmpfs /mnt
> > > > > > cd /mnt
> > > > > > for i in `seq 1 30`; do dd if=/dev/zero of=block-$i
> > > > > > bs=1024k
> > > > > > count=4000
> > > > > > ; done
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Client
> > > > > > --------
> > > > > > 
> > > > > > Once server is ready
> > > > > > 
> > > > > > Run ./start_opensm.sh on client (I sont use the SM on a
> > > > > > switch as
> > > > > > we
> > > > > > are back to back)
> > > > > > 
> > > > > > [root@ibclient ~]# cat ./start_opensm.sh
> > > > > > #!/bin/bash
> > > > > > rmmod ib_srpt
> > > > > > opensm -F opensm.1.conf &
> > > > > > opensm -F opensm.2.conf &
> > > > > > 
> > > > > > I will semail the conf only to you as well as the targecli
> > > > > > config
> > > > > > as th
> > > > > > eout is long.
> > > > > > 
> > > > > > 
> > > > > > Then run start_srp.sh
> > > > > > 
> > > > > > [root@ibclient ~]# cat ./start_srp.sh
> > > > > > run_srp_daemon  -V -f /etc/ddn/srp_daemon.conf -R 30 -T 10
> > > > > > -t
> > > > > > 7000
> > > > > > -ance -i mlx5_0 -p 1 1>/root/srp1.log 2>&1 &
> > > > > > run_srp_daemon  -V -f /etc/ddn/srp_daemon.conf -R 30 -T 10
> > > > > > -t
> > > > > > 7000
> > > > > > -ance -i mlx5_1 -p 1 1>/root/srp2.log 2>&1 &
> > > > > > 
> > > > > > [root@ibclient ~]# cat /etc/ddn/srp_daemon.conf
> > > > > > a      queue_size=128,max_cmd_per_lun=32,max_sect=32768
> > > > > > 
> > > > > > 
> > > > > 
> > > > > I see that you're link is IB.
> > > > > 
> > > > > I was working on RoCE link layer with rdma_cm.
> > > > > 
> > > > > I'll try to find some free IB setup tomorrow in my lab..
> > > > > 
> > > > > Can you try login using rdma_cm ? need to load ib_ipoib for
> > > > > that in
> > > > > IB
> > > > > network.
> > > > > 
> > > > > I'm trying to understand whether it's related to the link
> > > > > layer.
> > > > > 
> > > > > p.s. I think the target configuration file didn't arrive.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > Max,
> > > > 
> > > > Yes, I am, working primarily with SCSI over RDMA Protocol with
> > > > Infiniband HCA's in IB mode.
> > > > I am not using ROCE.
> > > > 
> > > > Also lets not forget this is a target only issue, the latest
> > > > 5.6
> > > > kernel
> > > > runs untouched with no issues on the initiator when the target
> > > > runs
> > > > either 5.4 or 5.6 with the revert.
> > > > It would run fine with the target on 5.5 as well if I reverted
> > > > the
> > > > commit on 5.5 too.
> > > > 
> > > > I am not able at this time to switch these adapters to Ethernet
> > > > mode
> > > > and ROCE
> > > > 
> > > > The weird thing is the failure is completely silent so
> > > > something in
> > > > the
> > > > Link layer with IB has to failing early.
> > > > Thje other strange observation is that the IB interfaces come
> > > > up with
> > > > no issue.
> > > > I will try add some debug after reboot into the failing kernel.
> > > > 
> > > > Regards
> > > > Laurence
> > > > 
> > > > 
> > > > 
> > > 
> > > Max,
> > > Rupesh in cc here will be testing latest upstream on a
> > > client/server
> > > configuration with ROCE and report back here on if he sees a
> > > similar
> > > issue with the LIO target with that commit.
> > > 
> > > I will continue working on the IB srpt issue by adding some
> > > debug.
> > > 
> > > If you think about anything related to the commit let me know.
> > 
> > Laurence,
> > 
> > As I said above, the most chances are that I removed some RW
> > initialization that wasn't protected by field_select and wasn't
> > marked in our PRM as RW field.
> > 
> > The question is which one.
> 
> I think that I know what is missing.
> Can you please try this patch?
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> index 1faac31f74d0..23f879da9104 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> @@ -1071,6 +1071,9 @@ int mlx5_core_modify_hca_vport_context(struct
> mlx5_core_dev *dev,
>                 MLX5_SET64(hca_vport_context, ctx, port_guid, req-
> >port_guid);
>         if (req->field_select & MLX5_HCA_VPORT_SEL_NODE_GUID)
>                 MLX5_SET64(hca_vport_context, ctx, node_guid, req-
> >node_guid);
> +       MLX5_SET(hca_vport_context, ctx, cap_mask1, req->cap_mask1);
> +       MLX5_SET(hca_vport_context, ctx, cap_mask1_field_select,
> +                req->cap_mask1_perm);
>         err = mlx5_cmd_exec(dev, in, in_sz, out, sizeof(out));
>  ex:
>         kfree(in);
> 
> 
> > 
> > Thanks
> > 
> > > 
> > > Regards
> > > Laurence
> > > 
> 
> 

Trying that now and will report back




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux