Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 17 Nov 2016 10:31:36 +1100

On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
> On Wed, Nov 16, 2016 at 12:30:09PM +1100, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
> >>On 11/15/2016 12:54 AM, Dave Chinner wrote:
> >>>On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> >>>>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> >>>There have been 1.2 million inodes reclaimed from the cache, but
> >>>there have only been 20,000 dirty inode buffer writes. Yes, that's
> >>>written 440,000 dirty inodes - the inode write clustering is
> >>>capturing about 22 inodes per write - but the inode writeback load
> >>>is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> >>>significantly on dirty inodes.
> >>
> >>I think our machines are different enough that we're not seeing the
> >>same problems.  Or at least we're seeing different sides of the
> >>problem.
> >>
> >>We have 130GB of ram and on average about 300-500MB of XFS slab,
> >>total across all 15 filesystems.  Your inodes are small and cuddly,
> >>and I'd rather have more than less.  I see more with simoop than we
> >>see in prod, but either way its a reasonable percentage of system
> >>ram considering the horrible things being done.
> >
> >So I'm running on 16GB RAM and have 100-150MB of XFS slab.
> >Percentage wise, the inode cache is a larger portion of memory than
> >in your machines. I can increase the number of files to increase it
> >further, but I don't think that will change anything.
> 
> I think the way to see what I'm seeing would be to drop the number
> of IO threads (-T) and bump both -m and -M.  Basically less inode
> working set and more memory working set.

If I increase m/M by any non-trivial amount, the test OOMs within a
couple of minutes of starting even after cutting the number of IO
threads in half. I've managed to increase -m by 10% without OOM -
I'll keep trying to increase this part of the load as much as I
can as I refine the patchset I have.

> >>Both patched (yours or mine) and unpatched, XFS inode reclaim is
> >>keeping up.   With my patch in place, tracing during simoop does
> >>show more kswapd prio=1 scanning than unpatched, so I'm clearly
> >>stretching the limits a little more.  But we've got 30+ days of
> >>uptime in prod on almost 60 machines.  The oom rate is roughly in
> >>line with v3.10, and miles better than v4.0.
> >
> >IOWs, you have a workaround that keeps your production systems
> >running. That's fine for your machines that are running this load,
> >but it's not working well for any of the other other loads I've
> >looked at.  That is, removing the throttling from the XFS inode
> >shrinker causes instability and adverse reclaim of the inode cache
> >in situations where the maintaining a working set in memory is
> >required for performance.
> 
> We agree on all of this much more than not.  Josef has spent a lot
> of time recently on shrinkers (w/btrfs but the ideas are similar),
> and I'm wrapping duct tape around workloads until the overall
> architecture is less fragile.
> 
> Using slab for metadata in an FS like btrfs where dirty metadata is
> almost unbounded is a huge challenge in the current framework.  Ext4
> is moving to dramatically bigger logs, so it would eventually have
> the same problems.

Your 8TB XFS filesystems will be using 2GB logs (unless mkfs
settings were tweaked manually), so there's a huge amount of
metadata 15x8TB XFS filesystems can pin in memory, too...

> >Indeed, one of the things I noticed with the simoops workload
> >running the shrinker patches is that it no longer kept either the
> >inode cache or the XFS metadata cache in memory long enough for the
> >du to run without requiring IO. i.e. the caches no longer maintained
> >the working set of objects needed to optimise a regular operation
> >and the du scans took a lot longer.
> 
> With simoop, du is supposed to do IO.  It's crazy to expect to be
> able to scan all the inodes on a huge FS (or 15 of them) and keep it
> all in cache along with everything else hadoop does.  I completely
> agree there are cases where having the working set in ram is valid,
> just simoop isn't one ;)

Sure, I was just pointing out that even simoop was seeing signficant
changes in cache residency as a result of this change....

> >That's what removing the blocking from the shrinker causes the
> >overall work rate to go down - it results in the cache not
> >maintaining a working set of inodes and so increases the IO load and
> >that then slows everything down.
> 
> At least on my machines, it made the overall work rate go up.  Both
> simoop and prod are 10-15% faster. 

Ok, I'll see if I can tune the workload here to behave more like
this....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html