Re: parallel file create rates (+high latency)

Daire Byrne <daire@xxxxxxxx> · Tue, 25 Jan 2022 12:52:46 +0000

On Mon, 24 Jan 2022 at 20:50, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>
> On Mon, Jan 24, 2022 at 08:10:07PM +0000, Daire Byrne wrote:
> > On Mon, 24 Jan 2022 at 19:38, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> > >
> > > On Sun, Jan 23, 2022 at 11:53:08PM +0000, Daire Byrne wrote:
> > > > I've been experimenting a bit more with high latency NFSv4.2 (200ms).
> > > > I've noticed a difference between the file creation rates when you
> > > > have parallel processes running against a single client mount creating
> > > > files in multiple directories compared to in one shared directory.
> > >
> > > The Linux VFS requires an exclusive lock on the directory while you're
> > > creating a file.
> >
> > Right. So when I mounted the same server/dir multiple times using
> > namespaces, all I was really doing was making the VFS *think* I wanted
> > locks on different directories even though the remote server directory
> > was actually the same?
>
> In that scenario the client-side locks are probably all different, but
> they'd all have to wait for the same lock on the server side, yes.

Yea, I was totally overthinking the problem. Thanks for setting me straight.

> > > So, if L is the time in seconds required to create a single file, you're
> > > never going to be able to create more than 1/L files per second, because
> > > there's no parallelism.
> >
> > And things like directory delegations can't help with this kind of
> > workload? You can't batch directories locks or file creates I guess.
>
> Alas, there are directory delegations specified in RFC 8881, but they
> are read-only, and nobody's implemented them.
>
> Directory write delegations could help a lot, if they existed.

Shame. And tackling that problem is way past my ability.

> > > So, it's not surprising you'd get a higher rate when creating in
> > > multiple directories.
> > >
> > > Also, that lock's taken on both client and server.  So it makes sense
> > > that you might get a little more parallelism from multiple clients.
> > >
> > > So the usual advice is just to try to get that latency number as low as
> > > possible, by using a low-latency network and storage that can commit
> > > very quickly.  (An NFS server isn't permitted to reply to the RPC
> > > creating the new file until the new file actually hits stable storage.)
> > >
> > > Are you really seeing 200ms in production?
> >
> > Yea, it's just a (crazy) test for now. This is the latency between two
> > of our offices. Running batch jobs over this kind of latency with a
> > NFS re-export server doing all the caching works surprisingly well.
> >
> > It's just these file creations that's the deal breaker. A batch job
> > might create 100,000+ files in a single directory across many clients.
> >
> > Maybe many containerised re-export servers in round-robin with a
> > common cache is the only way to get more directory locks and file
> > creates in flight at the same time.
>
> ssh into the original server and crate the files there?

That might work. Perhaps we can figure out the expected file outputs
and make that a local LAN task that runs first.

Actually, I can probably make it better by having each batch process
(client's of the re-export server) create their output files in a
unique directories rather than have all the files in one big shared
directory. There is still the slow creation of all those subdirs but
it's an order of magnitude less creates in one directory. Then the
files created across the subdirs can be paralellised on the wire.

> I've got no help, sorry.
>
> The client-side locking does seem redundant to some degree, but I don't
> know what to do about it.

Yea, it does seem like the server is the ultimate arbitrar and the
fact that multiple clients can achieve much higher rates of
parallelism does suggest that the VFS locking per client is somewhat
redundant and limiting (in this super niche case).

I can have multiple round-robin re-export servers but then they all
have to fetch the same data multiple times.

I'll read up on containerised NFS but it's not clear that you could
ever have a shared pagecache or fscache across multiple instances of
re-exports on the same server. Similar for Neil's "namespaces" patch I
think.

Thanks again for your help (and patience).

Daire