Re: [PATCH 1/3] xfs: Add rtdefault mount option

Richard Wareing <rwareing@xxxxxx> · Fri, 1 Sep 2017 20:36:53 +0000

> On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote:
> 
> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
>> Thanks for the quick feedback Dave!  My comments are in-line below.
>> 
>> 
>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> Hi Richard,
>>> 
>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
> ...
>>>> add
>>>> support for the more sophisticated AG based block allocator to RT
>>>> (bitmapped version works well for us, but multi-threaded use-cases
>>>> might not do as well).
>>> 
>>> That's a great big can of worms - not sure we want to open it. The
>>> simplicity of the rt allocator is one of it's major benefits to
>>> workloads that require deterministic allocation behaviour...
>> 
>> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
>> 
> 
> Just a side point based on the discussion so far... I kind of get the
> impression that the primary reason for using realtime support here is
> for the simple fact that it's a separate physical device. That provides
> a basic mechanism to split files across fast and slow physical storage
> based on some up-front heuristic. The fact that the realtime feature
> uses a separate allocation algorithm is actually irrelevant (and
> possibly a problem in the future).
> 
> Is that an accurate assessment? If so, it makes me wonder whether it's
> worth thinking about if there are ways to get the same behavior using
> traditional functionality. This ignores Dave's question about how much
> of the performance actually comes from simply separating out the log,
> but for example suppose we had a JBOD block device made up of a
> combination of spinning and solid state disks via device-mapper with the
> requirement that a boundary from fast -> slow and vice versa was always
> at something like a 100GB alignment. Then if you formatted that device
> with XFS using 100GB AGs (or whatever to make them line up), and could
> somehow tag each AG as "fast" or "slow" based on the known underlying
> device mapping, could you potentially get the same results by using the
> same heuristics to direct files to particular sets of AGs rather than
> between two physical devices? Obviously there are some differences like
> metadata being spread across the fast/slow devices (though I think we
> had such a thing as metadata only AGs), etc. I'm just handwaving here to
> try and better understand the goal.
> 

Sorry I forgot to clarify the origins of the performance wins here.   This is obviously very workload dependent (e.g. write/flush/inode updatey workloads benefit the most) but for our use case about ~65% of the IOP savings (~1/3 journal + slightly less than 1/3 sync of metadata from journal, slightly less as some journal entries get canceled), the remainder 1/3 of the win comes from reading small files from the SSD vs. HDDs (about 25-30% of our file population is <=256k; depending on the cluster).  To be clear, we don't split files, we store all data blocks of the files either entirely on the SSD (e.g. small files <=256k) and the rest on the real-time HDD device.  The basic principal here being that, larger files MIGHT have small IOPs to them (in our use-case this happens to be rare, but not impossible), but small files always do, and when 25-30% of your population is small...that's a big chunk of your IOPs.

The AG based could work, though it's going to be a very hard sell to use dm mapper, this isn't code we have ever used in our storage stack.  At our scale, there are important operational reasons we need to keep the storage stack simple (less bugs to hit), so keeping the solution contained within XFS is a necessary requirement for us.

Richard

> Brian
> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@xxxxxxxxxxxxx
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html