On Tue, Dec 08, 2015 at 03:52:52PM +0200, Avi Kivity wrote: > >>>With the way the XFS allocator works, it fills AGs from lowest to > >>>highest blocks, and if you free lots of space down low in the AG > >>>then that tends to get reused before the higher offset free space. > >>>hence the XFS allocates space in the above workload would result in > >>>roughly 1/3rd of the LBA space associated with the filesystem > >>>remaining unused. This is another allocator behaviour designed for > >>>spinning disks (to keep the data on the faster outer edges of > >>>drives) that maps very well to internal SSD allocation/reclaim > >>>algorithms.... > >>Cool. So we'll keep fstrim usage to daily, or something similarly low. > >Well, it's something you'll need to monitor to determine what the > >best frequency is, as even fstrim doesn't come for free (esp. if the > >storage does not support queued TRIM commands). > > I was able to trigger a load where discard caused io_submit to sleep > even on my super-fast nvme drive. > > The bad news is, disabling discard and running fstrim in parallel > with this load also caused io_submit to sleep. Well, yes. fstrim is not a magic bullet that /prevents/ discard from interrupting your application's IO - it's just a method under which the impact can be /somewhat controlled/ as it can be scheduled for periods where the impact has minimal interruption (e.g. when load is likely to be light, such as at 3am just before nightly backups are run). Regardless, it sounds like your steady state load could be described as "throwing as much IO as we possible can at the device", but you are then then having "blocking trouble" when maintenance (expensive) operations like TRIM need to be are run. I'm not sure this "blocking" can be prevented completely, because it assumes that you have a device of infinite IO capacity. That is, if you exceed the device's command queue depth and the IO scheduler request queue depth, the block layer will block in the IO scheduler waiting for a request queue slot to come free. Put simply: if you overload the IO subsystem, it will block. There's nothing we can do in the filesystem about this - this is the way the block layer works, and it's architected this way to provide the necessary feedback control for buffered write IO throttling and other congestion control mechanisms in the kernel. Sure, you can set the IO scheduler request queue depth to be really deep to avoid blocking, but this then simply increases your average and worst-case IO latency in overload situations. At some point you have to consider the IO subsystem is overloaded and the application driving it needs to back off. Something has to block when this happens... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs