On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote: > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote: > > I'd say the first thing to fix is the ordering problem on the > > completion queue. XFS needs more than just offset based ordering, it > > needs ioend type based ordering, too. All the size updates should be > > processed before the unwritten extent conversions, hence removing > > the nesting of transactions.... > > > > Generally speaking this makes sense if we need to retain the > preallocated size update transactions for some reason. One thing to note > is that we'd be putting on-disk size updates ahead of other ioend > completions on background writeback. I suspect that might not matter > much for unwritten ioends since we'd just expose unwritten extents after > a crash, but the effect on cow writeback completion is not as clear to > me. The old code (pre-queue+merge) completed completions in random order (i.e. whatever the workqueue and scheduler resulted in) so I don't think this is a concern. My concern is that it is a fairly major change of behaviour for IO completion, we know this is a performance sensitive path, and that we shouldn't make sweeping architectural changes without having a bunch of performance regression tests demonstrating that the change does not have unexpected, adverse side effects. Hence my suggestion that we first the deadlock first, then do the investigation work to determine if removing the pre-IO transaction reservation is a benefit or not. We often do the simple, low risk bug fix first, then do the more extensive rework that may have unexpected side effects so that we have a less risky backport candidate for stable distros. I don't see this as any different here. > For one, it looks like a cow ioend can both require a transaction > allocation for fork remap as well as have an append transaction already > attached, so we'd probably have to tweak how individual ioends are > processed as opposed to just ordering them differently. I also thought > cow blocks don't necessarily have to cover shared (or even existing) > blocks in the data fork due to preallocation, so we'd probably need to > investigate things like whether this makes it possible to put an on-disk > update ahead of a cow remap that lands somewhere in the range between > the in-core inode size and the (smaller) on-disk inode size, and if so, > whether that could result in problematic behavior. These things all used to happen before we (recently) added all the per-inode queuing and merging code to IO completion. So there's no real evidence that there is a problem completing them in different orders. If someone runs fsync, then they'll all get completed before fsync returns, so AFAICT this is just unnecessary complication of a relatively simple mechanism. > I'm not sure this is > worth the associated complexity if there's opportunity to remove the > need for most of these transactions in the first place. Hm? Please keep in mind I'm not saying "we can't remove this code" as Christoph has implied. Last time this came up, like the "use unwritten extents for delalloc", I asked for performance/benchmark results that indicate that the change doesn't introduce excessive overhead or unexpected regressions. Again, I'm asking for that work to be done *before* we make a signficant change in behaviour because that change in behaviour is not necessary to fix the bug. > @@ -182,12 +185,10 @@ xfs_end_ioend( > error = xfs_reflink_end_cow(ip, offset, size); > else if (ioend->io_type == IOMAP_UNWRITTEN) > error = xfs_iomap_write_unwritten(ip, offset, size, false); > - else > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > done: > - if (ioend->io_private) > - error = xfs_setfilesize_ioend(ioend, error); > + if (ioend->io_flags & IOMAP_F_APPEND) > + error = xfs_setfilesize(ip, offset, size); > iomap_finish_ioends(ioend, error); > memalloc_nofs_restore(nofs_flag); > } > @@ -221,16 +222,28 @@ xfs_end_io( > struct iomap_ioend *ioend; > struct list_head tmp; > unsigned long flags; > + xfs_off_t maxendoff; > > spin_lock_irqsave(&ip->i_ioend_lock, flags); > list_replace_init(&ip->i_ioend_list, &tmp); > spin_unlock_irqrestore(&ip->i_ioend_lock, flags); > > iomap_sort_ioends(&tmp); > + > + /* XXX: track max endoff manually? */ > + ioend = list_last_entry(&tmp, struct iomap_ioend, io_list); > + if (((ioend->io_flags & IOMAP_F_SHARED) || > + (ioend->io_type != IOMAP_UNWRITTEN)) && > + xfs_ioend_is_append(ioend)) { > + ioend->io_flags |= IOMAP_F_APPEND; > + maxendoff = ioend->io_offset + ioend->io_size; > + } > + > while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend, > io_list))) { > list_del_init(&ioend->io_list); > iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private); > + ASSERT(ioend->io_offset + ioend->io_size <= maxendoff); > xfs_end_ioend(ioend); > } > } So now when I run a workload that is untarring a large tarball full of small files, we have as many transaction reserve operations runnning concurrently as there are IO completions queued. Right now, we the single threaded writeback (bdi-flusher) runs reservations serially, so we are introducing a large amount of concurrency to reservations here. IOWs, instead of throttling active reservations before we submit the IO we end up with attempted reservations only being bounded by the number of kworkers the completion workqueue can throw at the system. Then we might have tens of active filesystems doing the same thing, each with their own set of workqueues and kworkers... Yes, we can make "lots of IO to a single inode" have less overhead, but we do not want to do that at the expense of bad behaviour when we have "single IOs to lots of inodes". That's the concern I have here, and that's going to take a fair amount of work to characterise and measure the impact, especially on large machines with slow disks... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx