Re: sleeps and waits during io_submit

Glauber Costa <glauber@xxxxxxxxxxxx> · Tue, 1 Dec 2015 14:07:41 -0500

Hi Brian,

>
> Either way, the extents have to be read in at some point and I'd expect
> that cpu to schedule onto some other task while that thread waits on I/O
> to complete (read-ahead could also be a factor here, but I haven't
> really dug into how that is triggered for buffers).
>

Being a datastore, we expect to run practically alone in any box we're
at. That means that there is no other task to run. If io_submit
blocks, the system blocks. The assumption that blocking will just
yield the processor for another thread makes sense in the general case
where you assume more than one application running and/or more than
one thread within the same application.

>From our user's perspective, however, every time that happens we can't
make progress. It doesn't really matter where it blocks.

If io_submit returns without blocking, we can still push more work,
even though the kernel is still not ready to proceed. If it blocks,
we're dead.

> Brian
>
>> >Brian
>> >
>> >>>>Seastar (the async user framework which we use to drive xfs) makes writing
>> >>>>code like this easy, using continuations; but of course from ordinary
>> >>>>threaded code it can be quite hard.
>> >>>>
>> >>>>btw, there was an attempt to make ext[34] async using this method, but I
>> >>>>think it was ripped out.  Yes, the mortal remains can still be seen with
>> >>>>'git grep EIOCBQUEUED'.
>> >>>>
>> >>>>>>>It sounds to me that first and foremost you want to make sure you don't
>> >>>>>>>have however many parallel operations you typically have running
>> >>>>>>>contending on the same inodes or AGs. Hint: creating files under
>> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under
>> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode
>> >>>>>>>number).
>> >>>>>>Unfortunately our directory layout cannot be changed.  And doesn't this
>> >>>>>>require having agcount == O(number of active files)?  That is easily in the
>> >>>>>>thousands.
>> >>>>>>
>> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely
>> >>>>>ballpark, but really it's something you'll probably just need to test to
>> >>>>>see how far you need to go to avoid AG contention.
>> >>>>>
>> >>>>>I'm primarily throwing the subdir thing out there for testing purposes.
>> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you
>> >>>>>can determine whether/how much it really helps with modified AG counts.
>> >>>>>I don't know enough about your application design to really comment on
>> >>>>>that...
>> >>>>We have O(cpus) shards that operate independently.  Each shard writes 32MB
>> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>> >>>>without blocking); the files are then flushed and closed, and later removed.
>> >>>>In parallel there are sequential writes and reads of large files using 128kB
>> >>>>buffers), as well as random reads.  Files are immutable (append-only), and
>> >>>>if a file is being written, it is not concurrently read.  In general files
>> >>>>are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>> >>>>truncate(), fdatasync(), and friends are called from a helper thread.
>> >>>>
>> >>>>As far as I can tell it should a very friendly load for XFS and SSDs.
>> >>>>
>> >>>>>>>  Reducing the frequency of block allocation/frees might also be
>> >>>>>>>another help (e.g., preallocate and reuse files,
>> >>>>>>Isn't that discouraged for SSDs?
>> >>>>>>
>> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed
>> >>>>>and thus never discarded..? Are you running fstrim?
>> >>>>mount -o discard.  And yes, overwrites are supposedly more expensive than
>> >>>>trim old data + allocate new data, but maybe if you compare it with the work
>> >>>>XFS has to do, perhaps the tradeoff is bad.
>> >>>>
>> >>>Ok, my understanding is that '-o discard' is not recommended in favor of
>> >>>periodic fstrim for performance reasons, but that may or may not still
>> >>>be the case.
>> >>I understand that most SSDs have queued trim these days, but maybe I'm
>> >>optimistic.
>> >>
>>

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs