On Thu, Apr 11, 2019 at 11:11:17AM +1000, Dave Chinner wrote: > On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote: > > On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote: > > > On Fri 05-04-19 08:17:30, Dave Chinner wrote: > > > > FYI, I'm working on a range lock implementation that should both > > > > solve the performance issue and the reader starvation issue at the > > > > same time by allowing concurrent buffered reads and writes to > > > > different file ranges. > > > > > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get > > > merged because he had no in-tree user at the time (he was more aiming at > > > converting mmap_sem which is rather difficult). But the generic lock > > > implementation should be well usable. > > > > > > Added Davidlohr to CC. > > > > > > Honza > > > > > > [1] https://lkml.org/lkml/2017/3/7/22 > > > > fyi this was the latest version (had some naming updates per peterz). > > > > https://lkml.org/lkml/2018/2/4/232 > > No, I wasn't aware of these because they haven't ever been posted to > a list I subscribe to and they haven't been merged. I'll go have a > look at them over the next few days. Rather disappointing, to tell the truth. The implementation doesn't scale nearly as a rwsem for concurrent IO using shared access (i.e. XFS direct IO). That's the baseline we need to get close to for range locks to be a viable prospect. Fio randrw numbers on a single file on a pmem device on a 16p machine using 4kB AIO-DIO iodepth 128 w/ fio on 5.1.0-rc3: IOPS read/write (direct IO) fio processes rwsem rangelock 1 78k / 78k 75k / 75k 2 131k / 131k 123k / 123k 4 267k / 267k 183k / 183k 8 372k / 372k 177k / 177k 16 315k / 315k 135k / 135k So uncontended, non-overlapping range performance is not really even in the ballpark right now, unfortunately. To indicate that range locking is actually working, let's do buffered read/write, which takes the rwsem exclusive for writes and so will be bound by that: IOPS read/write (buffered IO) fio processes rwsem rangelock 1 57k / 57k 64k / 64k 2 61k / 61k 111k / 111k 4 61k / 61k 228k / 228k 8 55k / 55k 195k / 195k 16 15k / 15k 40k / 40k So the win for mixed buffered IO is huge but it's at the cost of direct IO performance. We can't make that tradeoff, so if this implementation cannot be substantially improved we really can'ti use it. FUndamentally, the problem is that the interval tree work is all done under a spinlock, so by the time we get to 16p, 70% of the 16p that is being burnt is on the spinlock, and only 30% of the CPU time is actually doing IO work: + 70.78% 2.50% [kernel] [k] do_raw_spin_lock + 69.72% 0.07% [kernel] [k] _raw_spin_lock_irqsave - 68.27% 68.10% [kernel] [k] __pv_queued_spin_lock_slowpath ..... + 15.61% 0.07% [kernel] [k] range_write_unlock + 15.39% 0.06% [kernel] [k] range_read_unlock + 15.11% 0.12% [kernel] [k] range_read_lock + 15.11% 0.13% [kernel] [k] range_write_lock ..... + 12.92% 0.11% [kernel] [k] range_is_locked FWIW, I'm not convinced about the scalability of the rb/interval tree, to tell you the truth. We got rid of the rbtree in XFS for cache indexing because the multi-level pointer chasing was just too expensive to do under a spinlock - it's just not a cache efficient structure for random index object storage. FWIW, I have basic hack to replace the i_rwsem in XFS with a full range read or write lock with my XFS range lock implementation so it just behaves like a rwsem at this point. It is not in any way optimised at this point. Numbers for same AIO-DIO test are: IOPS read/write (direct IO) processes rwsem DB rangelock XFS rangelock 1 78k / 78k 75k / 75k 74k / 74k 2 131k / 131k 123k / 123k 134k / 134k 4 267k / 267k 183k / 183k 246k / 246k 8 372k / 372k 177k / 177k 306k / 306k 16 315k / 315k 135k / 135k 240k / 240k Performance is much closer to rwsems, but the single lock critical section still causes serious amounts of cacheline contention on pure shared-read lock workloads like this. FWIW, the XFS rangelock numbers match what I've been seeing with userspace perf tests using unoptimised pthread mutexes - ~600,000 lock/unlock cycles a second over ~100 concurrent non-overlapping ranges seem to be the limitation before futex contention burns down the house AS it is, I've done all the work to make XFS use proper range locks on top of your patch - I'll go back tomorrow and modify the XFS range locks to use the same API as your lock implementation. I'm interested to see what falls out when they are used as real range locks rather than a glorified rwsem.... Cheers, Dave. PS: there's a double lock bug in range_is_locked()..... -- Dave Chinner david@xxxxxxxxxxxxx