Re: [PATCH v2 0/3] XFS real-time device tweaks

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 6 Sep 2017 07:43:05 -0400

On Wed, Sep 06, 2017 at 06:54:41AM +0000, Richard Wareing wrote:
> 
> On 9/5/17, 8:45 PM, "Dave Chinner" <david@xxxxxxxxxxxxx> wrote:
> 
>     On Sun, Sep 03, 2017 at 10:02:41PM +0000, Richard Wareing wrote:
>     > 
>     > > On Sep 3, 2017, at 1:56 AM, Christoph Hellwig
>     > > <hch@xxxxxxxxxxxxx> wrote:
>     > > 
>     > > On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing
>     > > wrote:
>     > >> - Replaced rtdefault with rtdisable, this yields similar
>     > >> operational benefits when combined with the existing mkfs time
>     > >> setting of the inheritance flag on the root directory.  Allows
>     > >> temporary disabling of real-time allocation without having to
>     > >> walk entire FS to remove flags (which could be time consuming).
>     > >> I still don't think it's super obvious to an admin the
>     > >> real-time flag was put there at mkfs time (vs. rtdefault being
>     > >> in mount flags), but this gets me half of what I'm after.
>     > > 
>     > > I still don't understand this option.  What is the use case of
>     > > dynamically switching on/off these default to the rt device?
>     > > 
>     > 
>     > Say you are in a bit of an emergency, and you need IOPs *now*
>     > (incident recovery), w/ rtdisable you could funnel the IO to the
>     > SSD
>     
>     But it /doesn't do that/. It only disables new files from writing to
>     the rt device. All reads for data in the RT device and writes to
>     existing files still go to the RT device.
>     
>     
>     > without having to strip the inheritance bits from all the
>     > directories (which would require two walks....one to remove and
>     > one to add them all back).    I think this is about having some
>     > options during incidents, and a "kill-switch" should the need
>     > arise.
>     
>     And soon after the kill switch is triggered, your tiny data device
>     will go ENOSPC because changing that mount option effective removed
>     TBs of free space from the filesystem. Then things will really start
>     going bad.
>     
>     So maybe you didn't think this through properly - the last thing a
>     typical user would expect is a filesystem reporting TBs of free
>     space to go ENOSPC and not being able to recover, regardless of what
>     mount options are present. iAnd they'll be especially confused when
>     they start looking at inodes and seeing RT bits set all over the
>     place...
>     
>     It's just a recipe for confusion, unexpected behaviour and all I
>     see here is a support and triage nightmare. Not to mention FB will
>     move on to something else in a couple of years, and we get stuck
>     having to maintain it forever more (*cough* filestreams *cough*).
>     
> Fair enough, what are your thoughts on rtdefault, if I changed it to *not* set the inheritance bits, but take over this responsibility in their place?  My thinking here is this integrates better than inheritance bits w/ policy management systems such as Chef/Puppet.  Inheritance bits, on the other hand don¹t really lend themselves to machine level policies; they can be sprinkled about all over the FS, and a walk would be required to enforce a machine wide policy.
> 
> Or instead of a mount option, would a sysfs option be acceptable?
> 
> My hope is we don¹t move on, but collaborate a bit more with the open-source world on these sorts of problems instead of re-inventing the proverbial FS wheel (and re-learning old lessons solved many moons ago by FS developers).  Trying to do my part now, show it can be done and should be done.
> 

FWIW, I'm still a little confused as to the need for this mechanism.
What exactly is the use case for 1.) your specific environment and 2.)
to a traditional realtime user?

Something like rtdefault (or an rtro option for realtime readonly
behavior) seems a bit more generic to me if one wanted broad control
over the feature, but your fallocate mount thingy seems to already
accomplish that. I.e., if you made that thing set/clear RT on individual
files based purely on file size and you had a need to quickly disable
setting RT on new files, why can't you just remount without that option?
It seems to me you wouldn't need to care about the RT inherit flag
either way..?

>     > The other problem I see is accessibility and usability.  By making
>     > these decisions buried in more generic XFS allocation mechanisms
>     > or fnctl's, few developers are going to really understand how to
>     > safely use them (e.g. without blowing up their SSD's WAF or
>     > endurance). 
>     
>     The whole point of putting them into the XFS allocator as admin
>     policies is that *applications developers don't need to know they
>     exist*.
>     
> I get you now: *admins* need to know, but application developers not so much.
> 
>     > Fallocation is a better understood notion, easier to
>     > use and has wider support amongst existing utilities.
>     
>     Almost every application I've seen that uses fallocate does
>     something wrong and/or breaks a longevity or performance
>     optimisation that filesystems have been making for years. 
>     
>     fallocate is "easy to understand" but *difficult to use optimally*
>     because it's behaviour is tightly bound to the filesystem allocator
>     algorithms. i.e. it's easy to defeat hidden filesystem optimisations
>     with fallocate, but it's difficult to understand a sub-optimal
>     corner case in the filesystem allocator that fallocate could be used
>     to avoid.
>     
>     In reality, we don't want people using fallocate - the filesystem
>     algorithms should do the right thing so people don't need to modify
>     their applications. In cases like this, having the filesystem decide
>     automatically at first allocation what device to use is the right
>     way to integrate the functionality, not require users to use
>     fallocate to trigger such a decision and, as a side effect, prevent
>     the filesystem from making all the other optimisations they still
>     want it to make.
> 
> You make a good point here, on preventing the FS from making other optimizations.  I¹m re-working this as you and others have suggested (new version tomorrow).
> 
> And xfs_fsr would be the home for code migrating the file to the real-time device once it grows beyond some tunable size.  
> 

I pretty much agree with everything Dave says here, along with
Christoph's previous suggestion that this is better off in the allocator
than in the fallocate path. In the end, I think your current environment
won't know the difference because you fallocate everything up front
anyways (notwithstanding Dave's explanation as to why that might not be
the greatest idea, however). In fact, I think this would be much more
interesting overall if we could tier per-extent allocation rather than
per-file, but that of course is one of the limitations of using RT.

That said, while the implementation improvement makes sense, I'm still
not necessarily convinced that this has a place in the upstream realtime
feature. I'll grant you that I'm not terribly familiar with the
historical realtime use case.. Dave, do you see value in such a
heuristic as it relates to the realtime feature (not this tiering
setup)? Is there necessarily a mapping between a large file size and a
file that should be tagged realtime? E.g., I suppose somebody who is
using traditional realtime (i.e., no SSD) and has a mix of legitimate
realtime (streaming media) files and large sparse virt disk images or
something of that nature would need to know to not use this feature
(i.e., this requires documentation)..?

Brian

>     > Keep in
>     > mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW;
>     > 300TBW if we are lucky), so we want to design things such that
>     > application developers are less likely to step on land mines
>     > causing pre-mature SSD failure.
>     
>     Hmmm. I don't think they way you are using fallocate is doing what
>     you think it is doing.
>     
>     That is, using fallocate to preallocate all files so you can direct
>     allocation to a different device means that delayed allocation is
>     turned off. Hence XFS cannot optimise allocation across multiple
>     files at writeback time. This means that writeback across multiple
>     files will be sprayed around disjointed preallocated regions. When
>     using delayed allocation, the filesystem will allocate the blocks
>     for all the files sequentially and so the block layer merge will
>     them all into one big contiguous IO.
>     
>     IOWs, fallocate sprays write IO around because they decouple
>     allocation locality from temporal writeback locality and this causes
>     non-contiguous write patterns which are a significant contributin
>     factor to write amplification in SSDs.  In comparison, delayed
>     allocation results in large sequential IOs that minimise write
>     amplification in the SSD...
>     
>     Hence the method you describe that "maximises SSD life" won't help
>     - if anything it's going to actively harm the SSD life when
>     compared to just letting the filesystem use delayed allocation and
>     choose what device to write to at that time....
> 
> Wrt to SSDs you are completely correct on this, our fallocate calls were intended to pay up front on the write path for more favorable allocations which pay off during reads on HDDs.  For SSDs this clearly makes less sense, and an optimization we will need to make in our code for the reasons you point out.
> 
>     Hacking one-off high level controls into APIs like fallocate does
>     not work. Allocation policies need to be integrated into the
>     filesystem allocators for them to be effective and useful to
>     administrators and applications alike. fallocate is no sustitute for
>     the black magic that filesystems do to optimise allocation and IO
>     patterns....
>     
>     Cheers,
>     
>     Dave.
>     -- 
>     Dave Chinner
>     david@xxxxxxxxxxxxx
>     
> 
> Thanks for the great comments, suggestions & insights.  Learning a lot.
> 
> Richard
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html