On 9/5/17, 8:45 PM, "Dave Chinner" <david@xxxxxxxxxxxxx> wrote: On Sun, Sep 03, 2017 at 10:02:41PM +0000, Richard Wareing wrote: > > > On Sep 3, 2017, at 1:56 AM, Christoph Hellwig > > <hch@xxxxxxxxxxxxx> wrote: > > > > On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing > > wrote: > >> - Replaced rtdefault with rtdisable, this yields similar > >> operational benefits when combined with the existing mkfs time > >> setting of the inheritance flag on the root directory. Allows > >> temporary disabling of real-time allocation without having to > >> walk entire FS to remove flags (which could be time consuming). > >> I still don't think it's super obvious to an admin the > >> real-time flag was put there at mkfs time (vs. rtdefault being > >> in mount flags), but this gets me half of what I'm after. > > > > I still don't understand this option. What is the use case of > > dynamically switching on/off these default to the rt device? > > > > Say you are in a bit of an emergency, and you need IOPs *now* > (incident recovery), w/ rtdisable you could funnel the IO to the > SSD But it /doesn't do that/. It only disables new files from writing to the rt device. All reads for data in the RT device and writes to existing files still go to the RT device. > without having to strip the inheritance bits from all the > directories (which would require two walks....one to remove and > one to add them all back). I think this is about having some > options during incidents, and a "kill-switch" should the need > arise. And soon after the kill switch is triggered, your tiny data device will go ENOSPC because changing that mount option effective removed TBs of free space from the filesystem. Then things will really start going bad. So maybe you didn't think this through properly - the last thing a typical user would expect is a filesystem reporting TBs of free space to go ENOSPC and not being able to recover, regardless of what mount options are present. iAnd they'll be especially confused when they start looking at inodes and seeing RT bits set all over the place... It's just a recipe for confusion, unexpected behaviour and all I see here is a support and triage nightmare. Not to mention FB will move on to something else in a couple of years, and we get stuck having to maintain it forever more (*cough* filestreams *cough*). Fair enough, what are your thoughts on rtdefault, if I changed it to *not* set the inheritance bits, but take over this responsibility in their place? My thinking here is this integrates better than inheritance bits w/ policy management systems such as Chef/Puppet. Inheritance bits, on the other hand don¹t really lend themselves to machine level policies; they can be sprinkled about all over the FS, and a walk would be required to enforce a machine wide policy. Or instead of a mount option, would a sysfs option be acceptable? My hope is we don¹t move on, but collaborate a bit more with the open-source world on these sorts of problems instead of re-inventing the proverbial FS wheel (and re-learning old lessons solved many moons ago by FS developers). Trying to do my part now, show it can be done and should be done. > The other problem I see is accessibility and usability. By making > these decisions buried in more generic XFS allocation mechanisms > or fnctl's, few developers are going to really understand how to > safely use them (e.g. without blowing up their SSD's WAF or > endurance). The whole point of putting them into the XFS allocator as admin policies is that *applications developers don't need to know they exist*. I get you now: *admins* need to know, but application developers not so much. > Fallocation is a better understood notion, easier to > use and has wider support amongst existing utilities. Almost every application I've seen that uses fallocate does something wrong and/or breaks a longevity or performance optimisation that filesystems have been making for years. fallocate is "easy to understand" but *difficult to use optimally* because it's behaviour is tightly bound to the filesystem allocator algorithms. i.e. it's easy to defeat hidden filesystem optimisations with fallocate, but it's difficult to understand a sub-optimal corner case in the filesystem allocator that fallocate could be used to avoid. In reality, we don't want people using fallocate - the filesystem algorithms should do the right thing so people don't need to modify their applications. In cases like this, having the filesystem decide automatically at first allocation what device to use is the right way to integrate the functionality, not require users to use fallocate to trigger such a decision and, as a side effect, prevent the filesystem from making all the other optimisations they still want it to make. You make a good point here, on preventing the FS from making other optimizations. I¹m re-working this as you and others have suggested (new version tomorrow). And xfs_fsr would be the home for code migrating the file to the real-time device once it grows beyond some tunable size. > Keep in > mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW; > 300TBW if we are lucky), so we want to design things such that > application developers are less likely to step on land mines > causing pre-mature SSD failure. Hmmm. I don't think they way you are using fallocate is doing what you think it is doing. That is, using fallocate to preallocate all files so you can direct allocation to a different device means that delayed allocation is turned off. Hence XFS cannot optimise allocation across multiple files at writeback time. This means that writeback across multiple files will be sprayed around disjointed preallocated regions. When using delayed allocation, the filesystem will allocate the blocks for all the files sequentially and so the block layer merge will them all into one big contiguous IO. IOWs, fallocate sprays write IO around because they decouple allocation locality from temporal writeback locality and this causes non-contiguous write patterns which are a significant contributin factor to write amplification in SSDs. In comparison, delayed allocation results in large sequential IOs that minimise write amplification in the SSD... Hence the method you describe that "maximises SSD life" won't help - if anything it's going to actively harm the SSD life when compared to just letting the filesystem use delayed allocation and choose what device to write to at that time.... Wrt to SSDs you are completely correct on this, our fallocate calls were intended to pay up front on the write path for more favorable allocations which pay off during reads on HDDs. For SSDs this clearly makes less sense, and an optimization we will need to make in our code for the reasons you point out. Hacking one-off high level controls into APIs like fallocate does not work. Allocation policies need to be integrated into the filesystem allocators for them to be effective and useful to administrators and applications alike. fallocate is no sustitute for the black magic that filesystems do to optimise allocation and IO patterns.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx Thanks for the great comments, suggestions & insights. Learning a lot. Richard -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html