On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote: > On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: > > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > > > > If you have a many-core machine, and have many threads all > > > > wanting to > > > > briefly lock a give file (udev is known to do this), you can get > > > > quite > > > > poor performance. > > > > > > > > When one thread releases a lock, it wakes up all other threads > > > > that > > > > are waiting (classic thundering-herd) - one will get the lock and > > > > the > > > > others go to sleep. > > > > When you have few cores, this is not very noticeable: by the time > > > > the > > > > 4th or 5th thread gets enough CPU time to try to claim the lock, > > > > the > > > > earlier threads have claimed it, done what was needed, and > > > > released. > > > > With 50+ cores, the contention can easily be measured. > > > > > > > > This patchset creates a tree of pending lock request in which > > > > siblings > > > > don't conflict and each lock request does conflict with its > > > > parent. > > > > When a lock is released, only requests which don't conflict with > > > > each > > > > other a woken. > > > > > > > > Testing shows that lock-acquisitions-per-second is now fairly > > > > stable even > > > > as number of contending process goes to 1000. Without this > > > > patch, > > > > locks-per-second drops off steeply after a few 10s of processes. > > > > > > > > There is a small cost to this extra complexity. > > > > At 20 processes running a particular test on 72 cores, the lock > > > > acquisitions per second drops from 1.8 million to 1.4 million > > > > with > > > > this patch. For 100 processes, this patch still provides 1.4 > > > > million > > > > while without this patch there are about 700,000. > > > > > > > > NeilBrown > > > > > > > > --- > > > > > > > > NeilBrown (4): > > > > fs/locks: rename some lists and pointers. > > > > fs/locks: allow a lock request to block other requests. > > > > fs/locks: change all *_conflict() functions to return bool. > > > > fs/locks: create a tree of dependent requests. > > > > > > > > > > > > fs/cifs/file.c | 2 - > > > > fs/locks.c | 142 > > > > +++++++++++++++++++++++++-------------- > > > > include/linux/fs.h | 5 + > > > > include/trace/events/filelock.h | 16 ++-- > > > > 4 files changed, 103 insertions(+), 62 deletions(-) > > > > > > > > > > Nice work! I looked over this and I think it looks good. > > > > > > I made an attempt to fix this issue several years ago, but my > > > method > > > sucked as it ended up penalizing the unlocking task too much. This > > > is > > > much cleaner and should scale well overall, I think. > > > > I think I also took a crack at this at one point while I was at > > UM/CITI > > and never got anything I was happy with. Looks like good work! > > > > I remember one main obstacle that I felt like I never had a good > > benchmark.... > > > > How did you choose this workload and hardware? Was it in fact udev > > (booting a large machine?), or was there some other motivation? > > Some details can be found here: > > https://github.com/systemd/systemd/pull/9551 > > https://github.com/systemd/systemd/pull/8667#issuecomment-385520335 > and comments further down. 8667 has been superseded by 9551. > > The original problem was that the symlink "/dev/disk/by- > partlabel/primary" may be claimed by _many_ devices on big systems > under certain distributions, which use older versions of parted for > partition creation on GPT disk labels. I've seen systems with literally > thousands of contenders for this symlink. > > We found that with current systemd, this can cause a boot-time race > where the wrong device is eventually assigned the "best" contender for > the symlink (e.g. a partition on multipath member rather than a > partition on the multipath map itself). I extended the udev test suite, > creating a test that makes this race easily reproducible, at least on > systems with many CPUs (the test host I used most had 72 cores). > > I created an udev patch that would use systemd's built in fcntl-based > locking to avoid this race, but I found that it would massively slow > down the system, and found the contention to be in the spin locks in > posix_lock_common(). (I therefore added more the systemd patches to > make the locking scale better, but that's irrelevant for the kernel- > side discussion). > > I further created an artificial test just for the scaling of > fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the > scaling problem easily, and do some quantitive experiments. My tests > didn't use any byte ranges, only "full" locking of 0-byte files. Thanks for the explanation! I wonder whether there's also anything we could do to keep every waiter from having to take the same spinlock. --b. > > > Not that I'm likely to do it any time soon, but could you share > > sufficient details for someone else to reproduce your results? > > > > --b. > > The udev test code can be found in the above links. It adds a new > script "test/sd-script.py" that would be run after "test/sys-script.py" > using a numeric argument indicating the number of contenders for the > test link to be created, such as "python test/sd-script.py test 1000". > Next step would be running "test/udev-test.pl 152" e.g. under perf (152 > is the test ID of the scaling test). > > Of course I can also share my other test program if you desire so.