Re: xfs_extent_busy_flush vs. aio

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 7 Feb 2018 12:57:41 +1100

On Tue, Feb 06, 2018 at 04:10:12PM +0200, Avi Kivity wrote:
> 
> On 01/29/2018 11:56 PM, Dave Chinner wrote:
> > On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote:
> > > > There's many reasons this can happen, but the most common is the
> > > > working files in a directory (or subset of directories in the same
> > > > AG) have a combined space usage of larger than an AG ....
> > > That's certainly possible, even likely (one huge directory with all of the
> > > files).
> > > 
> > > This layout is imposed on us by the compatibility gods. Is there a way to
> > > tell XFS to change its policy of on-ag-per-directory?
> > mount with inode32. That rotors files around all AGs in a round
> > robin fashion instead of trying to keep directory locality for  a
> > working set. i.e. it distributes the files evenly across the
> > filesystem.
> 
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s09.html
> says:
> 
> "When 32 bit inode numbers are used on a volume larger than 1TB in size,
> several changes occur.
> 
> A 100TB volume using 256 byte inodes mounted in the default inode32 mode has
> just one percent of its space available for allocating inodes.
> 
> XFS will reserve the first 1TB of disk space exclusively for inodes to
> ensure that the imbalance is no worse than this due to file data
> allocations."

s/exclusively//

> Does this mean that a 1.1TB disk has 1TB reserved for inodes and 0.1TB left
> over for data?

No, that would be silly.

> Or is it driven by the "one percent" which is mentioned
> above, so it would be 0.011TB?

No, you're inferring behavioural rules that don't exist from a
simple example.

Maximum number of inodes is controlled by min(imaxpct, free space).
For inode32, "free space" is what's in the first 32 bits of the inode
address space. For inode64, it's global free space.

To enable this, inode32 sets the AGs wholly within the first 32 bits
of the inode address space to be "metadata prefered" and "inode
capable".

Important things to note:

	- "first 32 bits of inode address space" means the range of
	  space that inode32 reserves for inodes changes according
	  to inode size. 256 byte inodes = 1TB, 2kB inodes = 8TB. If
	  the filesystem is smaller than this threshold, then it
	  will silently use the inode64 allocation policy until the
	  filesystem is grown beyond 32 bit inode address space
	  size.

	- "inode capable" means inodes can be allocated in the AG

	- "metadata preferred" means user data will not get
	  allocated in this AG unless all non-prefered AGs are full.

So, assuming 256 byte inodes, you 1.1TB fs will have a imaxpct of
~25%, allowing a maximum of 256GB of inodes or about a billion
inodes.  But once you put more than 0.1TB of data into the
filesystem, data will start filling up the inode capable AGs as
well, and then your limit for inodes looks just like inode64 (i.e.
depedent on free space).

IOWs, inode32 limits where and how many inodes you can
create, not how much user data you can write inode the filesystem.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html