On Tue, Jun 23, 2020 at 02:33:48AM -0700, Rick Lindsley wrote: > On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote: > > > First off, this is not my platform, and not my problem, so it's funny > > you ask me :) > > Weeeelll, not your platform perhaps but MAINTAINERS does list you > first and Tejun second as maintainers for kernfs. So in that sense, > any patches would need to go thru you. So, your opinions do matter. Sure, but "help, I'm abusing your code interface, so fix your code interface and not my caller code" really isn't the best mantra :) > > Anyway, as I have said before, my first guesses would be: > > - increase the granularity size of the "memory chunks", reducing > > the number of devices you create. > > This would mean finding every utility that relies on this behavior. > That may be possible, although not easy, for distro or platform > software, but it's hard to guess what user-related utilities may have > been created by other consumers of those distros or that platform. In > any case, removing an interface without warning is a hanging offense > in many Linux circles. I agree, so find out who uses it! You can search all debian tools easily. You can ask any closed-source setup tools that are on your platforms if they use it. You can "break it and see if anyone notices", which is what we do all the time. The "hanging offence" is "I'm breaking this even if you are using it!". > > - delay creating the devices until way after booting, or do it > > on a totally different path/thread/workqueue/whatever to > > prevent delay at booting > > This has been considered, but it again requires a full list of utilities relying on this interface and determining which of them may want to run before the devices are "loaded" at boot time. It may be few, or even zero, but it would be a much more disruptive change in the boot process than what we are suggesting. Is that really the case? I strongly suggest you all do some research here. Oh, and wrap your email lines please... > > And then there's always: > > - don't create them at all, only only do so if userspace asks > > you to. > > If they are done in parallel on demand, you'll see the same problem (load average of 1000+, contention in the same spot.) You obviously won't hold up the boot, of course, but your utility and anything else running on the machine will take an unexpected pause ... for somewhere between 30 and 90 minutes. Seems equally unfriendly. I agree, but it shouldn't be shutting down the whole box, other stuff should run just fine, right? Is this platform really that "weak" that it can't handle this happening in a single thread/cpu? > A variant of this, which does have a positive effect, is to observe that coldplug during initramfs does seem to load up the memory device tree without incident. We do a second coldplug after we switch roots and this is the one that runs into timer issues. I have asked "those that should know" why there is a second coldplug. I can guess but would prefer to know to avoid that screaming option. If that second coldplug is unnecessary for the kernfs memory interfaces to work correctly, then that is an alternate, and perhaps even better solution. (It wouldn't change the fact that kernfs was not built for speed and this problem remains below the surface to trip up another.) > > However, nobody I've found can say that is safe, and I'm not fond of the 'see who screams' test solution. > > > You all have the userspace tools/users for this interface and know it > > best to know what will work for them. If you don't, then hey, let's > > just delete the whole thing and see who screams :) > > I guess I'm puzzled by why everyone seems offended by suggesting we change a mutex to a rw semaphore. In a vacuum, sure, but we have before and after numbers. Wouldn't the same cavalier logic apply? Why not change it and see who screams? I am offended as a number of years ago this same user of kernfs/sysfs did a lot of work to reduce the number of contentions in kernfs for this same reason. After that work was done, "all was good". Now this comes along again, blaming kernfs/sysfs, not the caller. Memory is only going to get bigger over time, you might want to fix it this way and then run away. But we have to maintain this for the next 20+ years, and you are not solving the root-problem here. It will come back again, right? > I haven't heard any criticism of the patch itself - I'm hearing criticism of the problem. This problem is not specific to memory devices. As we get larger systems, we'll see it elsewhere. We do already see a mild form of this when fibre finds 1000-2000 fibre disks and goes to add them in parallel. Small memory chunks introduces the problem at a level two orders of magnitude bigger, but eventually other devices will be subject to it too. Why not address this now? 1-2k devices are easy to handle, we handle 30k scsi devices today with no problem at all, and have for 15+ years. We are "lucky" there that the hardware is slower than kernfs/sysfs so that we are not the bottleneck at all. > 'Doctor, it hurts when I do this' > 'Then don't do that' > > Funny as a joke. Less funny as a review comment. Treat the system as a whole please, don't go for a short-term fix that we all know is not solving the real problem here. thanks, greg k-h