On Wed, Apr 15, 2015 at 11:56:53AM -0700, Andrew Morton wrote: > On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@xxxxxx> wrote: > > > Hi, > > > > This is a reposting of a patch that was originally in the blk-mq series. > > It has huge upside on shared access to a multiqueue device doing > > O_DIRECT, it's basically the scaling block that ends up killing > > performance. A quick test here reveals that we spend 30% of all system > > time just incrementing and decremening inode->i_dio_count. For block > > devices this isn't useful at all, as we don't need protection against > > truncate. For that test case, performance increases about 3.6x (!!) by > > getting rid of this inc/dec per IO. > > > > I've cleaned it up a bit since last time, integrating the checks in > > inode_dio_done() and adding a inode_dio_begin() so that callers don't > > need to know about this. > > > > We've been running a variant of this patch in the FB kernel for a while. > > I'd like to finally get this upstream. .... > Is there similar impact to direct-io-to-file? It would be nice to fix > that up also. Many filesystems do something along the lines of > > atomic_inc(i_dio_count); > wibble() > atomic_dev(i_dio_count); > __blockdev_direct_IO(...); > > and with your patch I think we could change them to > > atomic_inc(i_dio_count); > wibble() > __blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE); > atomic_dev(i_dio_count); Can't do it quite that way. AIO requires the i_dio_count to held until IO completion for all outstanding IOs. i.e. the increment needs to be in the submission path, the decrement needs to be in the dio_complete() path, otherwise we have AIO DIO in progress without a reference count we can wait on in truncate. Yes, we might be able to pull it up to the filesystem level now that dio_complete() is only ever called once per __blockdev_direct_IO() call, so that may be a solution we can use via filesystem ->end_io callbacks provided to __blockdev_direct_IO. > which would halve the atomic op load. XFS doesn't touch i_dio_count, so it would make no difference to it at all, which is important, given the DIO rates I can drive through a single file on XFS - it becomes rwsem cacheline bound on the shared IO lock at about 2 million IOPS (random 4k read) to a single file. FWIW, keep in mind that this i_dio_count originally came from XFS in the first place, and was pushed into the DIO layer to solve all the DIO vs extent manipulation problems other fileystems were having... > But that's piling hack on top of hack. Can we change the > do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or > increment i_dio_count"? ie: exclusion against truncate is wholly the > caller's responsibility. That way, this awkward sharing of > responsibility between caller and callee gets cleaned up and > DIO_IGNORE_TRUNCATE goes away. > > inode_dio_begin() would be a good place to assert that i_mutex is held, > btw. Can't do that, either, as filesystems like XFS don't hold the i_mutex during direct IO submission. > This whole i_dio_count thing is pretty nasty, really. If you stand > back and squint, it's basically an rwsem. I wonder if we can use an > rwsem... That doesn't avoid the atomic operations that limit performance. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html