Re: directory delegations

Bruce Fields <bfields@xxxxxxxxxxxx> · Thu, 4 Apr 2019 16:41:16 -0400

On Thu, Apr 04, 2019 at 04:03:42PM -0400, Bradley C. Kuszmaul wrote:
> It would also be possible with our file system to preallocate inode
> numbers (inumbers).
> 
> This isn't necessarily directly related to NFS, but one could
> imagine further extending NSF to allow a CREATE to happen entirely
> on the client by letting the client maintain a cache of preallocated
> inumbers.

So, we'd need new protocol to allow clients to request inode numbers,
and I guess we'd also need vfs interfaces to allow our server to request
them from various filesystems.  Naively, it sounds doable.  From what
Jeff says, this isn't a requirement for correctness, it's an
optimization for a case when the client creates and then immediately
does a stat (or readdir?).  Is that important?

--b.

> 
> Just for the fun of it, I'll tell you a little bit more about how we
> preallocate inumbers.
> 
> For Oracle's File Storage Service (FSS), Inumbers are cheap to
> allocate, and it's not a big deal if a few of them end up unused.
> Unused inode numbers don't use up any space. I would imagine that
> most B-tree-based file systems are like this.   In contrast in an
> ext-style file system, unused inumbers imply unused storage.
> 
> Furthermore, FSS never reuses inumbers when files are deleted. It
> just keeps allocating new ones.
> 
> There's a tradeoff between preallocating lots of inumbers to get
> better performance but potentially wasting the inumbers if the
> client were to crash just after getting a batch.   If you only ask
> for one at a time, you don't get much performance, but if you ask
> for 1000 at a time, there's a chance that the client could start,
> ask for 1000 and then immediately crash, and then repeat the cycle,
> quickly using up many inumbers.  Here's a 2-competetive algorithm to
> solve this problem (by "2-competetive" I mean that it's guaranteed
> to waste at most half of the inode numbers):
> 
>  * A client that has successfully created K files without crashing
> is allowed, when it's preallocated cache of inumbers goes empty, to
> ask for another K inumbers.
> 
> The worst-case lossage occurs if the client crashes just after
> getting K inumbers, and those inumbers go to waste.   But we know
> that the client successfully created K files, so we are wasting at
> most half the inumbers.
> 
> For a long-running client, each time it asks for another batch of
> inumbers, it doubles the size of the request.  For the first file
> created, it does it the old-fashioned way.   For the second file, it
> preallocated a single inumber.   For the third file, it preallocates
> 2 inumbers.   On the fifth file creation, it preallocates 4
> inumbers.  And so forth.
> 
> One obstacle to getting FSS to use any of these ideas is that we
> currently support only NFSv3.   We need to get an NFSv4 server
> going, and then we'll be interested in doing the server work to
> speed up these kinds of metadata workloads.
> 
> -Bradley
> 
> On 4/4/19 11:22 AM, Chuck Lever wrote:
> >
> >>On Apr 4, 2019, at 11:09 AM, Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
> >>
> >>On Wed, Apr 3, 2019 at 9:06 PM bfields@xxxxxxxxxxxx
> >><bfields@xxxxxxxxxxxx> wrote:
> >>>On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
> >>>>This proposal does look like it would be helpful.   How does this
> >>>>kind of proposal play out in terms of actually seeing the light of
> >>>>day in deployed systems?
> >>>We need some people to commit to implementing it.
> >>>
> >>>We have 2-3 testing events a year, so ideally we'd agree to show up with
> >>>implementations at one of those to test and hash out any issues.
> >>>
> >>>We revise the draft based on any experience or feedback we get.  If
> >>>nothing else, it looks like it needs some updates for v4.2.
> >>>
> >>>The on-the-wire protocol change seems small, and my feeling is that if
> >>>there's running code then documenting the protocol and getting it
> >>>through the IETF process shouldn't be a big deal.
> >>>
> >>>--b.
> >>>
> >>>>On 4/2/19 10:07 PM, bfields@xxxxxxxxxxxx wrote:
> >>>>>On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
> >>>>>>The create itself needs to be sync, but the attribute delegations mean
> >>>>>>that the client, not the server, is authoritative for the timestamps.
> >>>>>>So the client now owns the atime and mtime, and just sets them as part
> >>>>>>of the (asynchronous) delegreturn some time after you are done writing.
> >>>>>>
> >>>>>>Were you perhaps thinking about this earlier proposal?
> >>>>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
> >>>>>That's it, thanks!
> >>>>>
> >>>>>Bradley is concerned about performance of something like untar on a
> >>>>>backend filesystem with particularly high-latency metadata operations,
> >>>>>so something like your unstable file createion proposal (or actual write
> >>>>>delegations) seems like it should help.
> >>>>>
> >>>>>--b.
> >>The serialized create with something like an untar is a
> >>performance-killer though.
> >>
> >>FWIW, I'm working on something similar right now for Ceph. If a ceph
> >>client has adequate caps [1] for a directory and the dentry inode,
> >>then we should (in principle) be able to buffer up directory morphing
> >>operations and flush them out to the server asynchronously.
> >>
> >>I'm starting with unlink (mostly because it's simpler), and am mainly
> >>just returning early when we do have the right caps -- after issuing
> >>the call but before the reply comes in. We should be able to do the
> >>same for link, rename and create too. Create will require the Ceph MDS
> >>to delegate out a range of inode numbers (and that bit hasn't been
> >>implemented yet).
> >>
> >>My thinking with all of this is that the buffering of directory
> >>morphing operations is not as helpful as something like a pagecache
> >>write is, as we aren't that interested in merging operations that
> >>change the same dentry. However, being able to do them asynchronously
> >>should work really well. That should allow us to better parallellize
> >>create/link/unlink/rename on different dentries even when they are
> >>issued serially by a single task.
> >What happens if an asynchronous directory change fails (eg. ENOSPC)?
> >
> >
> >>RFC5661 doesn't currently provide for writeable directory delegations,
> >>AFAICT, but they could eventually be implemented in a similar way.
> >>
> >>[1]: cephfs capabilies (aka caps) are like a delegation for a subset
> >>of inode metadata
> >>--
> >>Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
> >--
> >Chuck Lever
> >
> >
> >