Re: [PATCH 00/10] exposing knfsd opens to userspace

Andreas Dilger <adilger@xxxxxxxxx> · Fri, 26 Apr 2019 13:00:19 +0200

> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@xxxxxxxx> wrote:
> 
> On Thu, Apr 25 2019, Andreas Dilger wrote:
> 
>> On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@xxxxxxxxxx> wrote:
>>> 
>>> From: "J. Bruce Fields" <bfields@xxxxxxxxxx>
>>> 
>>> The following patches expose information about NFSv4 opens held by knfsd
>>> on behalf of NFSv4 clients.  Those are currently invisible to userspace,
>>> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
>>> 
>>> The approach is to add a new directory /proc/fs/nfsd/clients/ with
>>> subdirectories for each active NFSv4 client.  Each subdirectory has an
>>> "info" file with some basic information to help identify the client and
>>> an "opens" directory that lists the opens held by that client.
>>> 
>>> I got it working by cobbling together some poorly-understood code I
>>> found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
>>> and tell me what I've got wrong, they're more than welcome, but at this
>>> stage I'm more curious for feedback on the interface.
>> 
>> Is this in procfs, sysfs, or a separate NFSD-specific filesystem?
>> My understanding is that "complex" files are verboten in procfs and sysfs?
>> We've been going through a lengthy process to move files out of procfs
>> into sysfs and debugfs as a result (while trying to maintain some kind of
>> compatibility in the user tools), but if it is possible to use a separate
>> filesystem to hold all of the stats/parameters I'd much rather do that
>> than use debugfs (which has become root-access-only in newer kernels).
> 
> /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific
> filesystem, originally created to replace the nfsd-specific systemcall.
> So the nfsd developers have a fair degree of latitude as to what can go
> in there.
> 
> But I *don't* think it is a good idea to follow this pattern.  Creating
> a separate control filesystem for every different module that thinks it
> has different needs doesn't scale well.  We could end up with dozens of
> tiny filesystems that all need to be mounted at just the right place.  I
> don't think that is healthy for Linus.
> 
> Nor do I think we should be stuffing stuff into debugfs that isn't
> really for debugging.  That isn't healthy either.
> 
> If sysfs doesn't meet our needs, then we need to raise that in
> appropriate fora and present a clear case and try to build consensus -
> because if we see a problem, then it is likely that others do to.

I definitely *do* see the restrictions sysfs as being a problem, and I'd
guess NFS developers thought the same, since the "one value per file"
paradigm means that any kind of complex data needs to be split over
hundreds or thousands of files, which is very inefficient for userspace to
use.  Consider if /proc/slabinfo had to follow the sysfs paradigm, this would
(on my system) need about 225 directories (one per slab) and 3589 separate
files in total (one per value) that would need to be read every second to
implement "slabtop".  Running strace on "top" shows it taking 0.25s wall time
to open and read the files for only 350 processes on my system, at 2 files
per process ("stat" and "statm"), and those have 44 and 7 values, respectively,
so if it had to follow the sysfs paradigm would make this far worse.

I think it would make a lot more sense to have one file per item of interest,
and make it e.g. a well-structured YAML format ("name: value", with indentation
denoting a hierarchy/grouping of related items) so that it can be both human
and machine readable, easily parsed by scripts using bash or awk, rather than
having an explicit directory+file hierarchy.  Files like /proc/meminfo and
/proc/<pid>/status are already YAML-formatted (or almost so), so it isn't ugly
like XML encoding.

> This is all presumably in the context of Lustre and while lustre is
> out-of-tree we don't have a lot of leverage.  So I wouldn't consider
> pursuing anything here until we get back upstream.

Sure, except that is a catch-22.  We can't discuss what is needed until
the code is in the kernel, but we can't get it into the kernel until the
files it puts in /proc have been moved into /sys?

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP