Re: sleeps and waits during io_submit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 12/01/2015 06:29 PM, Brian Foster wrote:
On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:

On 12/01/2015 06:01 PM, Brian Foster wrote:
On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
On 12/01/2015 04:56 PM, Brian Foster wrote:
On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
On 12/01/2015 03:11 PM, Brian Foster wrote:
On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
On 11/30/2015 06:14 PM, Brian Foster wrote:
On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
On 11/30/2015 04:10 PM, Brian Foster wrote:
...
The agsize/agcount mkfs-time heuristics change depending on the type of
storage. A single AG can be up to 1TB and if the fs is not considered
"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
default up to 4TB. If a stripe unit is set, the agsize/agcount is
adjusted depending on the size of the overall volume (see
xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
We'll experiment with this.  Surely it depends on more than the amount of
storage?  If you have a high op rate you'll be more likely to excite
contention, no?

Sure. The absolute optimal configuration for your workload probably
depends on more than storage size, but mkfs doesn't have that
information. In general, it tries to use the most reasonable
configuration based on the storage and expected workload. If you want to
tweak it beyond that, indeed, the best bet is to experiment with what
works.
We will do that.

Are those locks held around I/O, or just CPU operations, or a mix?
I believe it's a mix of modifications and I/O, though it looks like some
of the I/O cases don't necessarily wait on the lock. E.g., the AIL
pushing case will trylock and defer to the next list iteration if the
buffer is busy.

Ok.  For us sleeping in io_submit() is death because we have no other thread
on that core to take its place.

The above is with regard to metadata I/O, whereas io_submit() is
obviously for user I/O.
Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
async tasks?  I don't mind them blocking each other as long as they let my
io_submit alone.

Yeah, it can trigger metadata reads, force the log (the stale buffer
example) or push the AIL (wait on log space). Metadata changes made
directly via your I/O request are logged/committed via transactions,
which are generally processed asynchronously from that point on.

  io_submit() can probably block in a variety of
places afaict... it might have to read in the inode extent map, allocate
blocks, take inode/ag locks, reserve log space for transactions, etc.
Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
if somebody else has to do it.

I'm not following... if the fs needs to read in the inode extent map to
prepare for an allocation, what else can the thread do but wait? Are you
suggesting the request kick off whatever the blocking action happens to
be asynchronously and return with an error such that the request can be
retried later?
Not quite, it should be invisible to the caller.

That is, the code called by io_submit() (file_operations::write_iter, it
seems to be called today) can kick off this operation and have it continue
>from where it left off.
Isn't that generally what happens today?
You tell me.  According to $subject, apparently not enough.  Maybe we're
triggering it more often, or we suffer more when it does trigger (the latter
probably more likely).

The original mail describes looking at the sched:sched_switch tracepoint
which on a quick look, appears to fire whenever a cpu context switch
occurs. This likely triggers any time we wait on an I/O or a contended
lock (among other situations I'm sure), and it signifies that something
else is going to execute in our place until this thread can make
progress.

For us, nothing else can execute in our place, we usually have exactly one thread per logical core. So we are heavily dependent on io_submit not sleeping.

The case of a contended lock is, to me, less worrying. It can be reduced by using more allocation groups, which is apparently the shared resource under contention.

The case of waiting for I/O is much more worrying, because I/O latency are much higher. But it seems like most of the DIO path does not trigger locking around I/O (and we are careful to avoid the ones that do, like writing beyond eof).

(sorry for repeating myself, I have the feeling we are talking past each other and want to be on the same page)


  We submit an I/O which is
asynchronous in nature and wait on a completion, which causes the cpu to
schedule and execute another task until the completion is set by I/O
completion (via an async callback). At that point, the issuing thread
continues where it left off. I suspect I'm missing something... can you
elaborate on what you'd do differently here (and how it helps)?
Just apply the same technique everywhere: convert locks to trylock +
schedule a continuation on failure.

I'm certainly not an expert on the kernel scheduling, locking and
serialization mechanisms, but my understanding is that most things
outside of spin locks are reschedule points. For example, the
wait_for_completion() calls XFS uses to wait on I/O boil down to
schedule_timeout() calls. Buffer locks are implemented as semaphores and
down() can end up in the same place.

But, for the most part, XFS seems to be able to avoid sleeping. The call to __blockdev_direct_IO only launches the I/O, so any locking is only around cpu operations and, unless there is contention, won't cause us to sleep in io_submit().

Trying to follow the code, it looks like xfs_get_blocks_direct (and __blockdev_direct_IO's get_block parameter in general) is synchronous, so we're just lucky to have everything in cache. If it isn't, we block right there. I really hope I'm misreading this and some other magic is happening elsewhere instead of this.

Brian

Seastar (the async user framework which we use to drive xfs) makes writing
code like this easy, using continuations; but of course from ordinary
threaded code it can be quite hard.

btw, there was an attempt to make ext[34] async using this method, but I
think it was ripped out.  Yes, the mortal remains can still be seen with
'git grep EIOCBQUEUED'.

It sounds to me that first and foremost you want to make sure you don't
have however many parallel operations you typically have running
contending on the same inodes or AGs. Hint: creating files under
separate subdirectories is a quick and easy way to allocate inodes under
separate AGs (the agno is encoded into the upper bits of the inode
number).
Unfortunately our directory layout cannot be changed.  And doesn't this
require having agcount == O(number of active files)?  That is easily in the
thousands.

I think Glauber's O(nr_cpus) comment is probably the more likely
ballpark, but really it's something you'll probably just need to test to
see how far you need to go to avoid AG contention.

I'm primarily throwing the subdir thing out there for testing purposes.
It's just an easy way to create inodes in a bunch of separate AGs so you
can determine whether/how much it really helps with modified AG counts.
I don't know enough about your application design to really comment on
that...
We have O(cpus) shards that operate independently.  Each shard writes 32MB
commitlog files (that are pre-truncated to 32MB to allow concurrent writes
without blocking); the files are then flushed and closed, and later removed.
In parallel there are sequential writes and reads of large files using 128kB
buffers), as well as random reads.  Files are immutable (append-only), and
if a file is being written, it is not concurrently read.  In general files
are not shared across shards.  All I/O is async and O_DIRECT.  open(),
truncate(), fdatasync(), and friends are called from a helper thread.

As far as I can tell it should a very friendly load for XFS and SSDs.

  Reducing the frequency of block allocation/frees might also be
another help (e.g., preallocate and reuse files,
Isn't that discouraged for SSDs?

Perhaps, if you're referring to the fact that the blocks are never freed
and thus never discarded..? Are you running fstrim?
mount -o discard.  And yes, overwrites are supposedly more expensive than
trim old data + allocate new data, but maybe if you compare it with the work
XFS has to do, perhaps the tradeoff is bad.

Ok, my understanding is that '-o discard' is not recommended in favor of
periodic fstrim for performance reasons, but that may or may not still
be the case.
I understand that most SSDs have queued trim these days, but maybe I'm
optimistic.


_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs



[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux