On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote: > On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote: > > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote: ... > > > @@ -182,12 +185,10 @@ xfs_end_ioend( > > error = xfs_reflink_end_cow(ip, offset, size); > > else if (ioend->io_type == IOMAP_UNWRITTEN) > > error = xfs_iomap_write_unwritten(ip, offset, size, false); > > - else > > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > > > done: > > - if (ioend->io_private) > > - error = xfs_setfilesize_ioend(ioend, error); > > + if (ioend->io_flags & IOMAP_F_APPEND) > > + error = xfs_setfilesize(ip, offset, size); > > iomap_finish_ioends(ioend, error); > > memalloc_nofs_restore(nofs_flag); > > } > > @@ -221,16 +222,28 @@ xfs_end_io( > > struct iomap_ioend *ioend; > > struct list_head tmp; > > unsigned long flags; > > + xfs_off_t maxendoff; > > > > spin_lock_irqsave(&ip->i_ioend_lock, flags); > > list_replace_init(&ip->i_ioend_list, &tmp); > > spin_unlock_irqrestore(&ip->i_ioend_lock, flags); > > > > iomap_sort_ioends(&tmp); > > + > > + /* XXX: track max endoff manually? */ > > + ioend = list_last_entry(&tmp, struct iomap_ioend, io_list); > > + if (((ioend->io_flags & IOMAP_F_SHARED) || > > + (ioend->io_type != IOMAP_UNWRITTEN)) && > > + xfs_ioend_is_append(ioend)) { > > + ioend->io_flags |= IOMAP_F_APPEND; > > + maxendoff = ioend->io_offset + ioend->io_size; > > + } > > + > > while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend, > > io_list))) { > > list_del_init(&ioend->io_list); > > iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private); > > + ASSERT(ioend->io_offset + ioend->io_size <= maxendoff); > > xfs_end_ioend(ioend); > > } > > } > > So now when I run a workload that is untarring a large tarball full > of small files, we have as many transaction reserve operations > runnning concurrently as there are IO completions queued. > This patch has pretty much no effect on a typical untar workload because it is dominated by delalloc -> unwritten extent conversions that already require completion time transaction reservations. Indeed, a quick test to untar a kernel source tree produces no setfilesize events at all. I'm not sure we have many situations upstream where append transactions are used outside of perhaps cow completions (which already have a completion time transaction allocation for fork remaps) or intra-block file extending writes (that thus produce an inode size change within a mapped, already converted block). Otherwise a truncate down should always remove post-eof blocks and speculative prealloc originates from delalloc, so afaict those should follow the same general sequence. Eh? As it is, I think the performance concern is overstated but I'm happy to run any tests to confirm or deny that if you want to make more concrete suggestions. This patch is easy to test and pretty much survived an overnight regression run (outside of one or two things I have to look into..). I'm happy to adjust the approach from there, but I also think if it proves necessary there are fallback options (based around the original suggestion in my first mail) to preserve current submission time (serial) append transaction reservation that don't require to categorize/split or partially process the pending ioend list. Brian > Right now, we the single threaded writeback (bdi-flusher) runs > reservations serially, so we are introducing a large amount of > concurrency to reservations here. IOWs, instead of throttling active > reservations before we submit the IO we end up with attempted > reservations only being bounded by the number of kworkers the > completion workqueue can throw at the system. Then we might have > tens of active filesystems doing the same thing, each with their own > set of workqueues and kworkers... > > Yes, we can make "lots of IO to a single inode" have less overhead, > but we do not want to do that at the expense of bad behaviour when > we have "single IOs to lots of inodes". That's the concern I have > here, and that's going to take a fair amount of work to characterise > and measure the impact, especially on large machines with slow > disks... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >