Re: XFS on top of LVM span in AWS. Stripe or are AG's good enough?

Eric Sandeen <sandeen@xxxxxxxxxxx> · Wed, 17 Aug 2016 12:26:26 -0500

On 8/17/16 11:23 AM, Jeff Gibson wrote:
> Thanks for the great info guys.
> 
> Sorry to beat a dead horse here.  Just to be absolutely clear-
> 
>>> I guess what I'm trying to ask is - will XFS *indirectly* compensate
>>> if one subvolume is busier?  For example, if writes to a "slow"
>>> subvolume and resident AGs take longer to complete, will XFS tend to
>>> prefer to use other less-busy AGs more often (with the exception of
>>> locality) for writes?  What is the basic algorithm for determining
>>> where new data is written?  In load-balancer terms, does it
>>> round-robin, pick the least busy, etc?
>>  
>> xfs has no notion of fast vs slow regions.  See above for the basic
>> algorithm; it's round-robin for new directories, keep inodes and blocks
>> near their parent if possible.  

> So if one EBS LVM subvolume has subpar performance it will basically
> slow down writes to the whole XFS volume.  XFS doesn't have any
> notion of a queue per AG or any other mechanism for compensating
> uneven performance of AGs.

It will slow down writes to blocks in that block device.
If those blocks gate other IO (i.e. core metadata structures, maybe
the log), then it could conceivably have an fs-wide impact.

i.e. -

If file "foo" has 100 blocks allocated in a slow-responding volume,
writing to those 100 blocks would only slow down that write.

If the log is allocated in a slow-responding volume and a workload
is log-bound, then it could have an fs-wide impact.

Again, xfs has no fast/slow notion.  There is no compensation.
IO queues are below the filesystem; there is no IO queue per AG.

>> There are a few other smaller-granularity
>> heuristics related to stripe geometry as well.

> Oh, cool.  Since I'm considering stripe vs. linear for the LVM volume, I'd be very interested in what these are.

Simply things like allocating files on stripe boundaries if possible.

So if you have a 64k stripe, it would try (IIRC) to allocate files
(at least larger files, not remembering details for sure) on 64k
boundaries.

If you're keen to look at code, m_dalign is the stripe unit for the
fs. You'll find things like:

                /*
                 * Round up the allocation request to a stripe unit
                 * (m_dalign) boundary if the file size is >= stripe unit
                 * size, and we are allocating past the allocation eof.
                 *
                 * If mounted with the "-o swalloc" option the alignment is
                 * increased from the strip unit size to the stripe width.
                 */

or for inode allocation:

                /*
                 * Set the alignment for the allocation.
                 * If stripe alignment is turned on then align at stripe unit
                 * boundary.
                 * If the cluster size is smaller than a filesystem block
                 * then we're doing I/O for inodes in filesystem block size
                 * pieces, so don't need alignment anyway.
                 */

-Eric

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs