On Tue, Apr 9, 2019 at 3:42 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > Cephfs, like most network filesystems, sucks badly at metadata-heavy > workloads. The clients (kcephfs and libcephfs) always do synchronous > calls to the MDS for directory-morphing operations (create, unlink, link > and rename), and those RTT delays add up. > > In principle, cephfs is different in that if we have appropriate caps, > we ought to be able to buffer up directory morphing operations and > eventually flush them out to the MDS prior to releasing those caps. > While the biggest win from this approach is probably going to be in the > create codepath, starting with unlink is a lot simpler. > > The idea here is that if we hold refs on the appropriate caps (Fx on the > directory and Lx on the inode being unlinked), then we should be able to > return from the syscall immediately after transmitting the unlink > request, under the assumption that it will succeed. If the unlink does > fail, then we'd report an error when the caller does an fsync on the > parent directory. > > The series starts with some reorganization that allows the client to > do async MDS requests, and then the last several patches add the ability > to do an asynchronous unlink. > > For now, this is just an RFC series. I think we could probably take the > first 7 or so patches for the next merge window, but the async unlink > patches themselves should probably wait until the MDS better supports > this. > > I did do a little performance testing with this, but it doesn't seem to > improve things much if at all. Still, this is a good place to start with > async MDS ops, and we may be able to improve things later. > So it turns out that with some bugfixes that I do now see about a 2x speedup when removing a directory with 10000 files in it. I think that's enough of a proof of concept that this would be worthwhile, particularly once we are able to create file asynchronously. Simple test script: --------------8<----------------- #!/bin/sh TESTDIR=/mnt/cephfs/test.$$ mkdir $TESTDIR for i in `seq 1 10000`; do touch $TESTDIR/$i done time rm -r $TESTDIR --------------8<----------------- Testing on my crappy test rig: Unpatched kernel: $ ./test_unlink.sh real 0m2.428s user 0m0.011s sys 0m0.131s Patched kernel: $ ./test_unlink.sh real 0m1.272s user 0m0.007s sys 0m0.127s ...and the numbers were fairly consistent over multiple runs. I pushed a tag to my repo if anyone wants to have a look, but I'll avoid re-posting for now. This relies on some out of tree (and quite possibly dangerous) MDS patches too, so it's probably not worth wider testing just yet. https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ceph-async-unlink-20190410 -- Jeff Layton <jlayton@xxxxxxxxxxxxxxx>