On Thu, Oct 20, 2016 at 03:32:48PM -0700, Jared D. Cottrell wrote: > We've been running our Ubuntu 14.04-based, SSD-backed databases with a > weekly fstrim cron job, but have been finding more and more clusters Command line for fstrim? > that are locking all IO for a couple minutes as a result of the job. > In theory, mounting with discard could be appropriate for our use case > as file deletes are infrequent and handled in background threads. > However, we read some dire warnings about using discard on this list > (http://oss.sgi.com/archives/xfs/2014-08/msg00465.html) that make us > want to avoid it. discard is being improved - Christoph posted a patchset a few days ago that solve many of the XFS specific issues. It also tries to avoid the various deficiencies of underlying infrastructure as much as possible. > Is discard still to be avoided at all costs? Are the corruption and > bricking problems mentioned still something to be expected even with > the protection of Linux's built-in blacklist of broken SSD hardware? > We happen to be using Amazon's in-chassis SSDs. I'm sure they use > multiple vendors but I can't imagine they're taking short-cuts with > cheap hardware. Every so often we see a problem that manifests when discard is enabled, and it goes away when it is turned off. Not just on XFS - there's similar reports on the btrfs list. It's up to you to decide whether you use it or not. > If discard is still strongly discouraged, perhaps we can approach the > problem from the other side: does the slow fstrim we're seeing sounds > like a known issue? After a bunch of testing and research, we've > determined the following: > > Essentially, XFS looks to be iterating over every allocation group and > issuing TRIM s for all free extents every time this ioctl is called. > This, coupled with the facts that Linux's interface to the TRIM > command is both synchronous and does not support a vectorized list of > ranges (see: https://github.com/torvalds/linux/blob/3fc9d690936fb2e20e180710965ba2cc3a0881f8/block/blk-lib.c#L112), > is leading to a large number of extraneous TRIM commands (each of > which have been observed to be slow, see: > http://oss.sgi.com/archives/xfs/2011-12/msg00311.html) being issued to > the disk for ranges that both the filesystem and the disk know to be > free. In practice, we have seen IO disruptions of up to 2 minutes. I > realize that the duration of these disruptions may be controller > dependent. Unfortunately, when running on a platform like AWS, one > does not have the luxury of choosing specific hardware. Many issues here, none of what have changed recently. One of the common misconceptions about discard is that it will improve performance. People are lead to think "empty drive" SSD performance is what they should always get as that is what the manufacturers quote, not the performance once the drive has been completely written once. They are also lead to beleive that running TRIM will restore their drive to "empty drive" performance. This is not true - for most users, the "overwritten" performance is what you'll get for the majority of the life of an active drive, regardless of whether you use TRIM or not. If you want an idea of how different misleading the performance expectations manufacturers set for their SSDs, go have a look at the SSD "performance consistency" tests that are run on all SSDs at anandtech.com. e.g: Samsung's latest 960 Pro. Quoted at 360,000 random 4k write iops, it can actually only sustain 25,000 random 4k write iops once the drive has been filled, which only takes a few minutes to do: http://www.anandtech.com/show/10754/samsung-960-pro-ssd-review/3 This matches what will happen in the few hours after a TRIM is run on a SSD under constant write pressure where the filesystem used space pattern at the time of the fstrim was significantly different to the SSD's used space pattern. i.e. fstrim will free up used space in the SSD which means performance will go up and be fast (yay!), but as soon as the "known free" area is exhausted it will fall into the steady state where the garbage collection algorithm limits performance. At this point, running fstrim again won't make any difference to performance unless there are new areas of the block device address space have been freed by the filesystem. This is because SSD's record of "used space" still closely matches the filesystem's view of free space. Hence fstrim will fail to free any significant amount of space in the SSD it could use to improve performance, and so the SSD remains in the slow "garbage collection mode" to sustain ongoing writes. IOWs, fstrim/discard will not restore any significant SSD performance unless your application has a very dynamic filesystem usage pattern (i.e. regularly fills and empties the filesystem). That doesn't seem to be the situation your application is running in ("... our use case [...] file deletes are infrequent .. "), so maybe you're best to just disable fstrim altogether? Put simply: fstrim needs to be considered similarly to online defragmentation - it can be actively harmful to production workloads when it is used unnecessarily or inappropriately. > EXT4, on the other hand, tracks blocks that have been deleted since > the previous FITRIM ioctl ext4 tracks /block groups/, not blocks. Freeing a single 4k block in a 128MB block group will mark it for processing on the next fstrim run. IOWs if you are freeing blocks all over your filesystem between weekly fstrim runs, ext4 will behave pretty much identically to XFS. > and targets subsequent TRIM s to the > appropriate block ranges (see: > http://blog.taz.net.au/2012/01/07/fstrim-and-xfs/). In real-world > tests this significantly reduces the impact of fstrim to the point > that it is un-noticeable to the database / application. IMO that's a completely meaningless benchmark/comparison. To start with nobody runs fstrim twice in a row on production systems, so back-to-back behaviour is irrelevant to us. Also, every test is run on different hardware so the results simply cannot be compared to each other. Now if were run on the same hardware, with some kind of significant workload in between runs it would be slightly more meaningful.(*) A lot of the "interwebs knowledge" around discard, fstrim, TRIM, SSD performance, etc that you find with google is really just cargo-cult stuff. What impact fstrim is going to have on your SSDs is largely workload dependent, and the reality is that a large number of workloads don't have the dynamic allocation behaviour that allows regular usage of fstrim to provide a meaningful, measurable and sustained performance improvement. So, with all that in mind, the first thing you need to do is gather measurements to determine if SSD performance is actually improved after running a weekly fstrim. If there's no /significant/ change in IO latency or throughput, then fstrim is not doing anything useful for you and you can reduce the frequency at which you run it, only run it in scheduled maintenance windows, or simply stop using it. If there is a significant improvement in IO performance as a result of running fstrim, then we need to work out why your application is getting stuck during fstrim. sysrq-w output when fstrim is running and the application is blocking will tell us where the blocking issue lies (it may not be XFS!), and along with the various information about your system here: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F we should be able to determine what is causing the blocking and hence determine if it's fixable or not.... Cheers, Dave. (*) This test it'll probably still come out in favour of ext4 because of empty filesystem allocation patterns. i.e. ext4 allocation is all nice and compact until you dirty all the block groups in the filesystem, then the allocation patterns become scattered and non-deterministic. At that point, typical data intensive workloads will always dirty a significant proportion of the block groups in the filesystem, and fstrim behaviour becomes much more like XFS. XFS's behaviour does not change with workloads - it only changes as free space patterns changes. Hence it should be roughly consistent and predictable behaviour for a given free space pattern regardless of the workload or the age of the filesystem. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html