Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 14 Feb 2019 09:27:26 +1100

On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > For example, extent size hints just happen to skip delayed allocation.
> > > I don't recall the exact history, but I don't think this was always the
> > > case for extsz hints.
> > 
> > It wasn't, but extent size hints + delalloc never played nicely and
> > could corrupt or expose stale data and caused all sorts of problems
> > at ENOSPC because delalloc reservations are unaware of alignment
> > requirements for extent size hints. Hence to make extent size hints
> > work for buffered writes, I simply made them work the same way as
> > direct writes (i.e. immediate allocation w/ unwritten extents).
> > 
> > > So would the close time eofb trim be as
> > > problematic as for extsz hint files if the behavior of the latter
> > > changed back to using delayed allocation?
> > 
> > Yes, but if it's a write-once file that doesn't matter. If it's
> > write-many, then we'd retain the post-eof blocks...
> > 
> > > I think a patch for that was
> > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > functionality which still had unresolved issues (IIRC).
> > 
> > *nod*
> > 
> > > From another angle, would a system that held files open for a
> > > significant amount of time relative to a one-time write such that close
> > > consistently occurred after writeback (and thus delalloc conversion) be
> > > susceptible to the same level of free space fragmentation as shown
> > > above?
> > 
> > If the file is held open for writing for a long time, we have to
> > assume that they are going to write again (and again) so we should
> > leave the EOF blocks there. If they are writing slower than the eofb
> > gc, then there's nothing more we can really do in that case...
> > 
> 
> I'm not necessarily sure that this condition is always a matter of
> writing too slow. It very well may be true, but I'm curious if there are
> parallel copy scenarios (perhaps under particular cpu/RAM configs)
> where we could end up doing a large number of one time file writes and
> not doing the release time trim until an underlying extent (with
> post-eof blocks) has been converted in more cases than not.

I'm not sure I follow what sort of workload and situation you are
describing here. Are you talking about the effect of an EOFB gc pass
during ongoing writes?

> Perhaps this is more of a question of writeback behavior than anything
> in XFS, but if that can occur with files on the smaller side it's a
> clear path to free space fragmentation. I think it's at least worth
> considering whether we can improve things, but of course this requires
> significantly more thought and analysis to determine whether that's an
> actual problem and if the cure is worse than the disease. :P
> 
> > > (I'm not sure if/why anybody would actually do that.. userspace
> > > fs with an fd cache perhaps? It's somewhat besides the point,
> > > anyways...)).
> > 
> > *nod*
> > 
> > > More testing and thought is probably required. I _was_ wondering if we
> > > should consider something like always waiting as long as possible to
> > > eofb trim already converted post-eof blocks, but I'm not totally
> > > convinced that actually has value. For files that are not going to see
> > > any further appends, we may have already lost since the real post-eof
> > > blocks will end up truncated just the same whether it happens sooner or
> > > not until inode reclaim.
> > 
> > If the writes are far enough apart, then we lose any IO optimisation
> > advantage of retaining post-eof blocks (induces seeks because
> > location of new writes is fixed ahead of time). Then it just becomes
> > a fragmentation avoidance If the writes are slow enough,
> > fragmentation really doesn't matter a whole lot - it's when writes
> > are frequent and we trash the post-eof blocks quickly that it
> > matters.
> > 
> 
> It sounds like what you're saying is that it doesn't really matter
> either way at this point. There's no perf advantage to keeping the eof
> blocks in this scenario, but there's also no real harm in deferring the
> eofb trim of physical post-eof blocks because any future free space
> fragmentation damage has already been done (assuming no more writes come
> in).

Essentially, yes.

> The thought above was tip-toeing around the idea of (in addition to the
> one-time trim heuristic you mentioned above) never doing a release time
> trim of non-delalloc post-eof blocks.

Except we want to trim blocks in the in the cases where it's a write
once file that has been fsync()d or written by O_DIRECT w/ really
large extent size hints....

It seems to me that once we start making exceptions to the general
rule, we end up in a world of corner cases and, most likely,
unreliable heuristics.

> > IOWs, speculative prealloc beyond EOF is not just about preventing
> > fragmentation - it also helps minimise the per-write CPU overhead of
> > delalloc space accounting. (i.e. allows faster write rates into
> > cache). IOWs, for anything more than a really small files, we want
> > to be doing speculative delalloc on the first time the file is
> > written to.
> > 
> 
> Ok, this paper refers to CPU overhead as it contributes to lack of
> scalability.

Well, that was the experiments that were being performed. I'm using
it as an example of how per-write overhead is actually important to
throughput. Ignore the "global lock caused overall throughput
issues" because we don't have that problem any more, and instead
look at it as a demonstration of "anything that slows down a write()
reduces per-thread throughput".

> It certainly makes sense that a delayed allocation buffered
> write does more work than a non-alloc write, but this document isn't
> necessarily claiming that as evidence of measurably faster or slower
> single threaded buffered writes. Rather, that overhead becomes
> measurable with enough parallel buffered writes contending on a global
> allocation critical section (section 5.1 "Spinlocks in Hot Paths").

It's just an example of how increasing per-write overhead impacts
performance. The same can be said for all the memory reclaim
mods - they also reduced per-write overhead because memory
allocation for the page cache was much, much faster and required
much less CPU, so each write spent less time allocating pages to
cache the data to be written. hence, faster writes....

> Subsequently, section 6.1.2 addresses that particular issue via the
> introduction of per-cpu accounting. It looks like there is further
> discussion around the general link between efficiency and performance,
> which makes sense, but I don't think that draws a definite conclusion
> that speculative preallocation needs to be introduced immediately on
> sequential buffered writes.

Sure, because the post-EOF preallocation that was already happening
wasn't the problem, and we couldn't easily address the per-page
block mapping overhead with generic code changes that would be
useful in any circumstance (Linux community at the time was hostile
towards bringing anything useful to XFS into the core kernel code).

And this was only a small machine (24 CPUs) so we had to address the
locking problem because increasing speculative prealloc wouldn't
help solve the global lock contention problems on the 2048p machines
that SGI was shipping at the time....

> What I think it suggests is that we need to
> consider the potential scalability impact of any prospective change in
> speculative preallocation behavior (along with the other tradeoffs
> associated with preallocation) because less aggressive preallocation
> means more buffered write overhead.

It's not a scalability problem now because of the fixes made back
then - it's now just a per-write thread latency and throughput
issue.

What is not obvious from the paper is that I designed the XFS
benchmarks so that it wrote each file into a separate directories to
place them into different AGs so that they didn't fragment due to
interleaving writes. i.e. so we could directly control the number of
sequential write streams going to the disks.

We had to do that to minimise the seek overhead of the disks so
they'd run at full bandwidth rather than being seek bound - we had
to keep the 256 disks at >97% bandwidth utilisation to get to
10GB/S, so even one extra seek per disk per second and we wouldn't
get there.

IOWs, we'd carefully taken anything to do with fragmentation out of
the picture entirely, both for direct and buffered writes. We were
just focussed on write efficiency and maximising total throughput
and that largely made post-EOF preallocation irrelevant to us.

> BTW for historical context.. was speculative preallocation a thing when
> this paper was written?

Yes. specualtive prealloc goes way back into the 90s from Irix.  It
was first made configurable in XFS via the biosize mount option
added with v3 superblocks in 1997, but the initial linux port only
allowed up to 64k.

In 2005, the linux mount option allowed biosize to be extended to
1GB, which made sense because >4GB allocation groups (mkfs enabled
them late 2003) were now starting to be widely used and so users
were reporting new large AG fragmentation issues that had never been
seen before. i.e.  it was now practical to have contiguous multi-GB
extents in files and the delalloc code was struggling to create
them, so having EOF-prealloc be able to make use of that capability
was needed....

And then auto-tuning made sense because more and more people were
having to use the mount option in more general workloads to avoid
fragmentation.

And even since then it's abeen able tweaking the algorithms to stop
userspace from defeating it and re-introducing fragmentation
vectors...

> The doc suggests per-page block allocation needs
> to be reduced on XFS,

Yup. It still stands - that was the specific design premise that
lead to the iomap infrastructure we now have. i.e. do block mapping
once per write() call, not once per page per write call. early
prototypes got us a 10-20% increase in single threaded bufered write
throughput on slow CPUs, and with the current code some high
throughput buffered write cases either went up by 30-40% or CPU
usage went down by that amount.

> but also indicates that XFS had minimal
> fragmentation compared to other fs'. AFAICT, section 5.2 attributes XFS
> fragmentation avoidance to delayed allocation (not necessarily
> preallocation).

Right, because none of the other filesystems had delayed allocation
and they were interleaving concurent allocations at write() time.
i.e. they serialised write() for long periods of time while they did
global fs allocation, whilst XFS had a low overhead delalloc path
with a tiny critical global section. i.e. delalloc allowed us to
defer allocation to writeback where it isn't so throughput critical
and better allocation decisions can be made.

IOWs, delalloc was the primary difference in both the fragmentation
and scalability differences reported in the paper. That XFS did
small amounts of in-memory-only prealloc beyond EOF during delalloc
was irrelevant - it's just part of the delalloc mechanism in this
context....

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx