Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

Ian Kent <raven@xxxxxxxxxx> · Tue, 23 Jun 2020 16:01:52 +0800

On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > Hello, Ian.
> > > > 
> > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > The
> > > > > > > size of
> > > > > > > the
> > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > underlying
> > > > > > > hardware.
> > > > > > 
> > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > that,
> > > > > > right?
> > > > > 
> > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > ...
> > > > 
> > > > I was more trying to say that the sysfs device interface with
> > > > per-
> > > > object
> > > > directory isn't the right interface for this sort of usage at
> > > > all.
> > > > Are these
> > > > even real hardware pieces which can be plugged in and out?
> > > > While
> > > > being a
> > > > discrete piece of hardware isn't a requirement to be a device
> > > > model
> > > > device,
> > > > the whole thing is designed with such use cases on mind. It
> > > > definitely isn't
> > > > the right design for representing six digit number of logical
> > > > entities.
> > > > 
> > > > It should be obvious that representing each consecutive memory
> > > > range with a
> > > > separate directory entry is far from an optimal way of
> > > > representing
> > > > something like this. It's outright silly.
> > > 
> > > I agree.  And again, Ian, you are just "kicking the problem down
> > > the
> > > road" if we accept these patches.  Please fix this up properly so
> > > that
> > > this interface is correctly fixed to not do looney things like
> > > this.
> > 
> > Fine, mitigating this problem isn't the end of the story, and you
> > don't want to do accept a change to mitigate it because that could
> > mean no further discussion on it and no further work toward solving
> > it.
> > 
> > But it seems to me a "proper" solution to this will cross a number
> > of areas so this isn't just "my" problem and, as you point out,
> > it's
> > likely to become increasingly problematic over time.
> > 
> > So what are your ideas and recommendations on how to handle hotplug
> > memory at this granularity for this much RAM (and larger amounts)?
> 
> First off, this is not my platform, and not my problem, so it's funny
> you ask me :)

Sorry, but I don't think it's funny at all.

It's not "my platform" either, I'm just the poor old sole that
took this on because, on the face of it, it's a file system
problem as claimed by others that looked at it and promptly
washed their hands of it.

I don't see how asking for your advice is out of order at all.

> 
> Anyway, as I have said before, my first guesses would be:
> 	- increase the granularity size of the "memory chunks",
> reducing
> 	  the number of devices you create.

Yes, I didn't get that from your initial comments but you've said
it a couple of times recently and I do get it now.

I'll try and find someone appropriate to consult about that and
see where it goes.

> 	- delay creating the devices until way after booting, or do it
> 	  on a totally different path/thread/workqueue/whatever to
> 	  prevent delay at booting

When you first said this it sounded like a ugly workaround to me.
But perhaps it isn't (I'm not really convinced it is TBH), so it's
probably worth trying to follow up on too.

> 
> And then there's always:
> 	- don't create them at all, only only do so if userspace asks
> 	  you to.

At first glance the impression I get from this is that it's an even
uglier work around than delaying it but it might actually the most
sensible way to handle this, as it's been called, silliness.

We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
the dentry op ->d_automount() to be called on access so, from a path
walk perspective, the dentries could just appear when needed.

The question I'd need to answer is do the kernfs nodes exist so
->d_automount() can discover if the node lookup is valid, and I think
the answer might be yes (but we would need to suppress udev
notifications for S_AUTOMOUNT nodes).

The catch will be that this is "not" mounting per-se, so anything
I do would probably be seen as an ugly hack that subverts the VFS
automount support.

If I could find a way to reconcile that I could probably do this.

Al, what say you on this?

> 
> You all have the userspace tools/users for this interface and know it
> best to know what will work for them.  If you don't, then hey, let's
> just delete the whole thing and see who screams :)

Please, no joking, I'm finding it hard enough to cope with this
disappointment as it is, ;)

Ian