On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote: > > > On 12/01/2015 06:01 PM, Brian Foster wrote: > >On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: > >> > >>On 12/01/2015 04:56 PM, Brian Foster wrote: > >>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: > >>>>On 12/01/2015 03:11 PM, Brian Foster wrote: > >>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: > >>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote: > >>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > >>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote: > >>>>>... > >>>>>>>The agsize/agcount mkfs-time heuristics change depending on the type of > >>>>>>>storage. A single AG can be up to 1TB and if the fs is not considered > >>>>>>>"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the > >>>>>>>default up to 4TB. If a stripe unit is set, the agsize/agcount is > >>>>>>>adjusted depending on the size of the overall volume (see > >>>>>>>xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). > >>>>>>We'll experiment with this. Surely it depends on more than the amount of > >>>>>>storage? If you have a high op rate you'll be more likely to excite > >>>>>>contention, no? > >>>>>> > >>>>>Sure. The absolute optimal configuration for your workload probably > >>>>>depends on more than storage size, but mkfs doesn't have that > >>>>>information. In general, it tries to use the most reasonable > >>>>>configuration based on the storage and expected workload. If you want to > >>>>>tweak it beyond that, indeed, the best bet is to experiment with what > >>>>>works. > >>>>We will do that. > >>>> > >>>>>>>>Are those locks held around I/O, or just CPU operations, or a mix? > >>>>>>>I believe it's a mix of modifications and I/O, though it looks like some > >>>>>>>of the I/O cases don't necessarily wait on the lock. E.g., the AIL > >>>>>>>pushing case will trylock and defer to the next list iteration if the > >>>>>>>buffer is busy. > >>>>>>> > >>>>>>Ok. For us sleeping in io_submit() is death because we have no other thread > >>>>>>on that core to take its place. > >>>>>> > >>>>>The above is with regard to metadata I/O, whereas io_submit() is > >>>>>obviously for user I/O. > >>>>Won't io_submit() also trigger metadata I/O? Or is that all deferred to > >>>>async tasks? I don't mind them blocking each other as long as they let my > >>>>io_submit alone. > >>>> > >>>Yeah, it can trigger metadata reads, force the log (the stale buffer > >>>example) or push the AIL (wait on log space). Metadata changes made > >>>directly via your I/O request are logged/committed via transactions, > >>>which are generally processed asynchronously from that point on. > >>> > >>>>> io_submit() can probably block in a variety of > >>>>>places afaict... it might have to read in the inode extent map, allocate > >>>>>blocks, take inode/ag locks, reserve log space for transactions, etc. > >>>>Any chance of changing all that to be asynchronous? Doesn't sound too hard, > >>>>if somebody else has to do it. > >>>> > >>>I'm not following... if the fs needs to read in the inode extent map to > >>>prepare for an allocation, what else can the thread do but wait? Are you > >>>suggesting the request kick off whatever the blocking action happens to > >>>be asynchronously and return with an error such that the request can be > >>>retried later? > >>Not quite, it should be invisible to the caller. > >> > >>That is, the code called by io_submit() (file_operations::write_iter, it > >>seems to be called today) can kick off this operation and have it continue > >>from where it left off. > >> > >Isn't that generally what happens today? > > You tell me. According to $subject, apparently not enough. Maybe we're > triggering it more often, or we suffer more when it does trigger (the latter > probably more likely). > The original mail describes looking at the sched:sched_switch tracepoint which on a quick look, appears to fire whenever a cpu context switch occurs. This likely triggers any time we wait on an I/O or a contended lock (among other situations I'm sure), and it signifies that something else is going to execute in our place until this thread can make progress. > > We submit an I/O which is > >asynchronous in nature and wait on a completion, which causes the cpu to > >schedule and execute another task until the completion is set by I/O > >completion (via an async callback). At that point, the issuing thread > >continues where it left off. I suspect I'm missing something... can you > >elaborate on what you'd do differently here (and how it helps)? > > Just apply the same technique everywhere: convert locks to trylock + > schedule a continuation on failure. > I'm certainly not an expert on the kernel scheduling, locking and serialization mechanisms, but my understanding is that most things outside of spin locks are reschedule points. For example, the wait_for_completion() calls XFS uses to wait on I/O boil down to schedule_timeout() calls. Buffer locks are implemented as semaphores and down() can end up in the same place. Brian > > > >>Seastar (the async user framework which we use to drive xfs) makes writing > >>code like this easy, using continuations; but of course from ordinary > >>threaded code it can be quite hard. > >> > >>btw, there was an attempt to make ext[34] async using this method, but I > >>think it was ripped out. Yes, the mortal remains can still be seen with > >>'git grep EIOCBQUEUED'. > >> > >>>>>It sounds to me that first and foremost you want to make sure you don't > >>>>>have however many parallel operations you typically have running > >>>>>contending on the same inodes or AGs. Hint: creating files under > >>>>>separate subdirectories is a quick and easy way to allocate inodes under > >>>>>separate AGs (the agno is encoded into the upper bits of the inode > >>>>>number). > >>>>Unfortunately our directory layout cannot be changed. And doesn't this > >>>>require having agcount == O(number of active files)? That is easily in the > >>>>thousands. > >>>> > >>>I think Glauber's O(nr_cpus) comment is probably the more likely > >>>ballpark, but really it's something you'll probably just need to test to > >>>see how far you need to go to avoid AG contention. > >>> > >>>I'm primarily throwing the subdir thing out there for testing purposes. > >>>It's just an easy way to create inodes in a bunch of separate AGs so you > >>>can determine whether/how much it really helps with modified AG counts. > >>>I don't know enough about your application design to really comment on > >>>that... > >>We have O(cpus) shards that operate independently. Each shard writes 32MB > >>commitlog files (that are pre-truncated to 32MB to allow concurrent writes > >>without blocking); the files are then flushed and closed, and later removed. > >>In parallel there are sequential writes and reads of large files using 128kB > >>buffers), as well as random reads. Files are immutable (append-only), and > >>if a file is being written, it is not concurrently read. In general files > >>are not shared across shards. All I/O is async and O_DIRECT. open(), > >>truncate(), fdatasync(), and friends are called from a helper thread. > >> > >>As far as I can tell it should a very friendly load for XFS and SSDs. > >> > >>>>> Reducing the frequency of block allocation/frees might also be > >>>>>another help (e.g., preallocate and reuse files, > >>>>Isn't that discouraged for SSDs? > >>>> > >>>Perhaps, if you're referring to the fact that the blocks are never freed > >>>and thus never discarded..? Are you running fstrim? > >>mount -o discard. And yes, overwrites are supposedly more expensive than > >>trim old data + allocate new data, but maybe if you compare it with the work > >>XFS has to do, perhaps the tradeoff is bad. > >> > >Ok, my understanding is that '-o discard' is not recommended in favor of > >periodic fstrim for performance reasons, but that may or may not still > >be the case. > > I understand that most SSDs have queued trim these days, but maybe I'm > optimistic. > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs