Re: Best kernel function to probe for NFS write accounting?

Daire Byrne <daire@xxxxxxxx> · Sat, 22 Jul 2023 10:06:24 +0100

On a related note, I have always wondered if there was any interest in
having something like the /proc/PID/io just for tracking NFS client
throughput?

The problem is that if you copy a file from NFS to a local filesystem,
there is no way to infer whether a process did a NFS read/write (or
any NFS IO at all).

It is useful to track per PID network IO and things like cgroups (v1)
do not provide an easy way to do that. In our case, 99.9% of all
network IO a render blade does is NFS client traffic.

To your question, I can't say what the BPF equivalent is, but we used
systemtap to track per process and per file IO on each render node.
However, again we are only interested in IO that results in actual
network packets so we needed to account for reads from page cache too.

We did it by watching vfs.add_to_page_cache and naively assuming every
hit resulted in 4k of network NFS reads. In this way we infer that the
read comes over the network as it's not in the page cache yet. The
aggregate from all clients matched the network of our NFS servers
pretty well so this approach worked for us. We could track all client
file IO and correlate it with what the server was doing over the
network.

The systemtap code was something like the following where files were
tracked by nfs.fop.open:

        probe nfs.fop.open {
          pid = pid()
          filename = sprintf("%s", d_path(&\$filp->f_path))
          if (filename =~ "/net/.*/data") {
            files[pid, ino] = filename
            if ( !([pid, ino] in procinfo))
              procinfo[pid, ino] = sprintf("%s", proc())
          }
        }
        probe vfs.add_to_page_cache {
          pid = pid()
          if ([pid, ino] in files ) {
            readpage[pid, ino] += 4096
            files_store[pid, ino] = sprintf("%s", files[pid, ino])
          }
        }

But I should say that this no longer works in newer kernels since the
addition of folios and I have not figured out a better way to track
NFS client reads while excluding the page cache results.

For the writes I was just using vfs.write and vfs.writev - I was not
too concerned about writeback delays.

       probe vfs.write {
          pid = pid()
          if ([pid, ino] in files) {
            write[pid, ino] += bytes_to_write
            files_store[pid, ino] = sprintf("%s", files[pid, ino])
          }
        }

I hope that helps. Being from the same industry, we obviously have
similar requirements... ;)

Daire

On Fri, 21 Jul 2023 at 23:46, <lars@xxxxxxxxx> wrote:
>
> Hello,
>
> I'm using BPF to do NFS operation accounting for user-space processes. I'd like
> to include the number of bytes read and written to each file any processes open
> over NFS.
>
> For write operations, I'm currently using an fexit probe on the
> nfs_writeback_done function, and my program appears to be getting the
> information I'm hoping for. But I can see that under some circumstances the
> actual operations are being done by kworker threads, and so the PID reported by
> the BPF program is for that kworker instead of the user-space process that
> requested the write.
>
> Is there a more appropriate function to probe for this information if I only
> want it triggered in context of the user-space process that performed the
> write? If not, I'm wondering if there's enough information in a probe triggered
> in the kworker context to track down the user-space PID that initiated the
> writes.
>
> I didn't find anything related in the kernel's Documentation directory, and I'm
> not yet proficient enough with the vfs, nfs, and sunrpc code to find an
> appropriate function myself.
>
> If it matters, our infrastructure is all based on NFSv3.
>
> Thanks for any leads or documentation pointers!
> Lars