Re: agcount for 2TB, 4TB and 8TB drives

Avi Kivity <avi@xxxxxxxxxxxx> · Mon, 16 Oct 2017 13:00:32 +0300

On 10/16/2017 01:00 AM, Dave Chinner wrote:
On Sun, Oct 15, 2017 at 12:36:03PM +0300, Avi Kivity wrote:

On 10/15/2017 01:42 AM, Dave Chinner wrote:
On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote:
On 10/11/2017 01:55 AM, Dave Chinner wrote:
On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
On 10/10/2017 01:03 AM, Dave Chinner wrote:
On 10/09/2017 02:23 PM, Dave Chinner wrote:
On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
Sure, that might be the IO concurrency the SSD sees and handles, but
you very rarely require that much allocation parallelism in the
workload. Only a small amount of the IO submission path is actually
allocation work, so a single AG can provide plenty of async IO
parallelism before an AG is the limiting factor.
Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
AGs don't issue IO. Applications issue IO, the filesystem allocates
space from AGs according to the write IO that passes through it.
What I meant was I/O in order to satisfy an allocation (read from
the free extent btree or whatever), not the application's I/O.
Once you're in the per-AG allocator context, it is single threaded
until the allocation is complete. We do things like btree block
readahead to minimise IO wait times, but we can't completely hide
things like metadata read Io wait time when it is required to make
progress.
I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the
free space btree, or just contention? (I expect the latter from the
patches I've seen, but perhaps I missed something).
No, it checks at a high level whether allocation is needed (i.e. IO
into a hole) and if allocation is needed, it punts the IO
immediately to the background thread and returns to userspace. i.e.
it never gets near the allocator to begin with....
Interesting, that's both good and bad. Good, because we avoided a
potential stall. Bad, because if the stall would not actually have
happened (lock not contended, btree nodes cached) then we got punted
to the helper thread which is a more expensive path.
Avoiding latency has costs in complexity, resources and CPU time.
That's why we've never ended up with a fully generic async syscall
interface in the kernel - every time someone tries, it dies the
death of complexity.

RWF_NOWAIT is simple, easy to maintain and has, in most cases, no
observable overhead.

There is no observable overhead in the kernel, but there will be some 
for the application. As soon as we cross a hint boundary writes start to 
fail, and the application needs to move them to a helper thread and 
re-submit them. These duplicate submissions happen until the helper 
thread is able to respond, and the first write manages to allocate the 
space.

Without RWF_NOWAIT, there are two possibilities: either you get lucky 
and the first write to cross the boundary doesn't block, or you get  
unlucky and you stall. There's no doubt that RWF_NOWAIT is a lot better, 
but it does cause the system to do some more work. I guess it can be 
amortized away with larger hints.

In fact we don't even need to try the write, we know that every
32MB/128k = 256 writes we will hit an allocation. Perhaps we can
fallocate() the next 32MB chunk while writing to the previous one.
fallocate will block *all* IO and mmap faults on that file, not just
the ones that require allocation. fallocate creates a complete IO
submission pipeline stall, punting all new IO submissions to the
background worker where they will block until fallocate completes.

Ok, I'll stay away from it, except during close time to remove unused 
extents.

IOWs, in terms of overhead, IO submission efficiency and IO pipeline
bubbles, fallocate is close the worst thing you can possibly do.
Extent size hints are far more efficient and less intrusive than
manually using fallocate from userspace.

If fallocate() is fast enough, writes will both never block/fail. If
it's not, then we'll block/fail, but the likelihood is reduced. We
can even increase the chunk size if we see we're getting blocked.
If you call fallocate, other AIO writes will always get blocked
because fallocate creates an IO submission barrier. fallocate might
be fast, but it's also a total IO submission serialisation point and
so has a much more significant effect on IO submission latency when
compared to doing allocation directly in the IO path via extent size
hints...

Got it.

Even better would be if XFS would detect the sequential write and
start allocating ahead of it.
That's what delayed allocation does with buffered IO. We
specifically do not do that with direct IO because it's direct IO
and we only do exactly what the IO the user submits requires us to
do.

As it is, I'm not sure that it would gain us anything over extent
size hints because they are effectively doing exactly the same thing
(i.e.  allocate ahead) on every write that hits a hole beyond
EOF when extending the file....

If I understand correctly, you do get momentary serialization when you 
cross a hint boundary, while with allocate ahead, you would not.

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html