Re: RDMA device renames and node description

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 19, 2020 at 09:14:06AM -0500, Dennis Dalessandro wrote:
> On 2/19/2020 2:11 AM, Leon Romanovsky wrote:
> > On Tue, Feb 18, 2020 at 12:11:47PM -0500, Dennis Dalessandro wrote:
> > > On 2/18/2020 9:04 AM, Leon Romanovsky wrote:
> > > > On Fri, Feb 14, 2020 at 01:13:53PM -0500, Dennis Dalessandro wrote:
> > > > > Was there any discussion on the upgrade scenario for existing deployments as
> > > > > far as device-rename changing node descriptions?
> > > > >
> > > > > If someone is running an older version of rdma-core they are going to have a
> > > > > certain set of node descriptions for each node. This could be in logs, or
> > > > > configuration databases, who knows what. Now if they upgrade to a new
> > > > > version of rdma-core their node descriptions all automatically change out
> > > > > from under them by default.
> > > > >
> > > > > Of course the admin could disable the rename prior to upgrade and as Leon
> > > > > pointed out previously the upgrade won't remove the disablement file. The
> > > > > problem is they would have to know to do that ahead of time.
> > > >
> > > > Dennis,
> > > >
> > > > It was discussed and the conclusion was that most if not all users are
> > > > using one of two upgrade and strategy.
> > >
> > > Do you have a pointer to a thread I can read, I apparently missed it?
> >
> > First, we started to talk about it even before patches were sent.
> > See this summary from LPC 2017:
> >   * the sysadmin will be able to disable this for "backward support"
> > https://lore.kernel.org/linux-rdma/20170917125603.GA5788@mtr-leonro.local/
> > Second, during the submission too, just need to continue to google it :)
>
> So it was discussed at a meeting 2 years ago that not everyone was at, I
> certainly wasn't, and you summarized with:
>
> 		* predictable or persistent device names?
> 			* need to be able to rename a device
>
> That's not very helpful. Even Jason's presentation that is linked there does
> not address the down side to the node rename especially as far as the impact
> to node description is concerned.
>
> I have looked at the original submission and again I don't see any mention
> of the node description problem. Just an admission that the names are harder
> to read and not what everyone is used to but being consistent in scripts is
> much more important [1].
>
> I'd have to say the script angle is far less important than configuration
> files for thousands of nodes of a large deployment being obsoleted without
> an end user's knowledge beforehand.
>
> > >
> > > > First option is to rely on distro and every distro behaves differently
> > > > in such cases, some of them won't change anything till their last user
> > > > dies :) and others more dynamic with more up-to-date packages already
> > > > adopted our default.
> > >
> > > This is the issue I see. The problem is when the distro doesn't know any
> > > better and pulls in a new rdma-core and breaks things unintentionally. Up to
> > > date is good, but up to date that brings with it what is essentially an ABI
> > > breakage is not.
> >
> > ABI breakage is a strong word, luckily enough it is not defined at all.
> > We never considered dmesg prints, device names, device ordering as an
> > ABI. You can't rely on debug features too, they can disappear too.
>
> Agree, it is a strong word and we can call it what you want. The point is
> you should be able to rely on the node description not being changed out
> from under you unnecessarily though. We aren't talking about a debug feature
> here but a core feature to real world deployments.
>
> Could you envision a patch to a user space library that just changes a
> devices hostname to something that was HW specific because it makes
> scripting easier? I contend that in some cases the node description
> remaining constant is just as important.

I think that the opposite is true and afraid that change to old naming
scheme will return us to the state where broken software left unfixed.

So why don't you add such patch to Intel OFED package?

>
> > So the bottom line, the expectation that distro should fix all broken
> > software before enabling device renaming and their bugs are not excuse
> > to declare ABI breakage.
>
> Again, call it what you want, but you can't deny this change to force the
> rename by default has not broken things. For the record I'm not even talking
> about PSM2 here. There are other, more far reaching implications.

Sure, I'm not arguing about that, however like any other upstream
project, we want to keep an option to change defaults. We followed the
same path like netdev and systemd did in this regards.

>
> > >
> > > > Second option is to use numerous OFED stacks, which are expected to
> > > > provide full upgrade to all components which will work smoothly.
> > >
> > > Yeah I'm sure OFED will handle things for themselves.
> >
> > At the end, OFED stacks behave like "mini-distros", so if they manage to
> > handle it, distro should do the same.
>  The difference there is to the distro the RDMA sub system is but one small
> part. To OFED it is the sole focus. So I expect OFED stacks to be more agile
> at handling this sort of thing.

I disagree about first part of this paragraph. All major distributions
follow closely rdma-core and this ML.

>
> [1]https://patchwork.kernel.org/patch/10870445/
>
> -Denny



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux