Re: [PATCH v2 0/3] XFS real-time device tweaks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Sep 3, 2017, at 1:56 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> 
> On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing wrote:
>> - Replaced rtdefault with rtdisable, this yields similar operational
>> benefits when combined with the existing mkfs time setting of the inheritance
>> flag on the root directory.  Allows temporary disabling of real-time allocation
>> without having to walk entire FS to remove flags (which could be time consuming).
>> I still don't think it's super obvious to an admin the real-time flag was put
>> there at mkfs time (vs. rtdefault being in mount flags), but this gets me
>> half of what I'm after.
> 
> I still don't understand this option.  What is the use case of
> dynamically switching on/off these default to the rt device?
> 

Say you are in a bit of an emergency, and you need IOPs *now* (incident recovery), w/ rtdisable you could funnel the IO to the SSD without having to strip the inheritance bits from all the directories (which would require two walks....one to remove and one to add them all back).    I think this is about having some options during incidents, and a "kill-switch" should the need arise.

>> - rtfallocmin no changes, need to think more about this.  Still a pretty big
>> fan of this option for reasons already stated; at least until a more elegant
>> solution such as preferred AGs (we'd need a tunable size for the "preferred"
>> AG, since our SSD partitions are a fraction of the size of a normal AG) can 
>> be implemented.  The only other idea I have is to make a new ioctl e.g. 
>> "norealtime", which causes the RT bits to stay cleared regardless of 
>> inheritance bits on the containing directory.  This would allowing the 
>> "steering" of files to the data device (e.g. SSD); this is probably a safer 
>> design than defaulting to SSD and steering to the HDD via the realtime ioctl.  
> 
> Jens just added a nice new fcntl to declare the life time of write
> streams (and in theory can add other I/O hints).
> 
> How about a a mount option that moves all I/O with a given hint
> to the RT device?  E.g. rt=longlife would direct I/O on a file
> with an rw hint of RWH_WRITE_LIFE_LONG or RWH_WRITE_LIFE_EXTREME to the
> RT subvolume as long as there aren't any previous extents.

You seem to trust application developers more than I :).  The problem I see with the lifetime, or allocation size as a hint is that a user could later append to the file and fill up the SSD.  A "norealtime" or fallocation request is a bit more explicit and high signal about the intent vs. the lifetime or allocation size alone.  It's possible, I happen to trickle writes into a file which may ultimately become very very large (e.g. logging), or perhaps introduce a performance or buffering bug which triggers smaller writes (allocations) and altered write lifetimes.  With fallocmin, this won't happen as the assumption/relationship here is clear, you are clearly declaring your intent to write a file of N bytes, and based on that we promote or demote you to the appropriate tier of storage.

The other problem I see is accessibility and usability.  By making these decisions buried in more generic XFS allocation mechanisms or fnctl's, few developers are going to really understand how to safely use them (e.g. without blowing up their SSD's WAF or endurance).  Fallocation is a better understood notion, easier to use and has wider support amongst existing utilities.  Keep in mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW; 300TBW if we are lucky), so we want to design things such that application developers are less likely to step on land mines causing pre-mature SSD failure.

Whatever the ultimate solution here, it should be designed such that it's relatively difficult to accidentally write data to the non-RT device (e.g. SSD in our case); intent must be clear and high signal.  Thus my similar "high-signal" bias in my first patchset w/ rtdefault; sure inheritance bits should be there if somebody mkfs'd, but if somehow they were removed, it could wind up costing 10's of millions of dollars in reduced SSD write lifetime at our scale.  An explicit mount option makes me sleep better at night, things like chef/cfengine can enforce this through traditional policy mechanisms, and removing the behavior has a higher bar (remount + change chef/cfengine) than a trivial call to xfs_io.  From a production engineering/reliability stand-point the design decision is pretty clear; inheritance bits are nearly unenforcible with policy engines such as cfengine or chef (somebody/something could remove a bit buried in the FS and you'd find it only by walking the entire FS), and as result they are bombs waiting to go off compared to the rtdefault flag.

I'd ideally like to take things even further with fallocmin, and revert to the RT device should the non-RT device fill up (subtracting some % of space for metadata); this brings the behavior more along the lines of a "preferred" device vs. a must-have.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux