On Tue, Mar 30, 2021 at 10:42:48AM -0400, Brian Foster wrote: > On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote: > > On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote: > > > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote: > ... > > > > > @@ -182,12 +185,10 @@ xfs_end_ioend( > > > error = xfs_reflink_end_cow(ip, offset, size); > > > else if (ioend->io_type == IOMAP_UNWRITTEN) > > > error = xfs_iomap_write_unwritten(ip, offset, size, false); > > > - else > > > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > > > > > done: > > > - if (ioend->io_private) > > > - error = xfs_setfilesize_ioend(ioend, error); > > > + if (ioend->io_flags & IOMAP_F_APPEND) > > > + error = xfs_setfilesize(ip, offset, size); > > > iomap_finish_ioends(ioend, error); > > > memalloc_nofs_restore(nofs_flag); > > > } > > > @@ -221,16 +222,28 @@ xfs_end_io( > > > struct iomap_ioend *ioend; > > > struct list_head tmp; > > > unsigned long flags; > > > + xfs_off_t maxendoff; > > > > > > spin_lock_irqsave(&ip->i_ioend_lock, flags); > > > list_replace_init(&ip->i_ioend_list, &tmp); > > > spin_unlock_irqrestore(&ip->i_ioend_lock, flags); > > > > > > iomap_sort_ioends(&tmp); > > > + > > > + /* XXX: track max endoff manually? */ > > > + ioend = list_last_entry(&tmp, struct iomap_ioend, io_list); > > > + if (((ioend->io_flags & IOMAP_F_SHARED) || > > > + (ioend->io_type != IOMAP_UNWRITTEN)) && > > > + xfs_ioend_is_append(ioend)) { > > > + ioend->io_flags |= IOMAP_F_APPEND; > > > + maxendoff = ioend->io_offset + ioend->io_size; > > > + } > > > + > > > while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend, > > > io_list))) { > > > list_del_init(&ioend->io_list); > > > iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private); > > > + ASSERT(ioend->io_offset + ioend->io_size <= maxendoff); > > > xfs_end_ioend(ioend); > > > } > > > } > > > > So now when I run a workload that is untarring a large tarball full > > of small files, we have as many transaction reserve operations > > runnning concurrently as there are IO completions queued. > > > > This patch has pretty much no effect on a typical untar workload because > it is dominated by delalloc -> unwritten extent conversions that already > require completion time transaction reservations. Indeed, a quick test > to untar a kernel source tree produces no setfilesize events at all. Great, so it's not the obvious case because of the previous delalloc->unwritten change. What you need to do now is find out what workload it is that is generating so many setfilesize completions despite delalloc->unwritten so we can understand what workloads this will actually impact. > I'm not sure we have many situations upstream where append transactions > are used outside of perhaps cow completions (which already have a > completion time transaction allocation for fork remaps) or intra-block > file extending writes (that thus produce an inode size change within a > mapped, already converted block). Otherwise a truncate down should > always remove post-eof blocks and speculative prealloc originates from > delalloc, so afaict those should follow the same general sequence. Eh? > > As it is, I think the performance concern is overstated but I'm happy to > run any tests to confirm or deny that if you want to make more concrete > suggestions. As I said: I'm happy to change the code as long as we understand what workloads it will impact and by how much. We don't know what workload is generating so many setfilesize transactions yet, so we can't actually make educated guesses on the wider impact that this change will have. We also don't have numbers from typical workloads, just one data point, so nobody actually knows what the impact is. > This patch is easy to test and pretty much survived an > overnight regression run (outside of one or two things I have to look > into..). I'm happy to adjust the approach from there, but I also think > if it proves necessary there are fallback options (based around the > original suggestion in my first mail) to preserve current submission > time (serial) append transaction reservation that don't require to > categorize/split or partially process the pending ioend list. IO path behaviour changes require more than functional regression tests. There is an amount of documented performance regression testing that is required, too. The argument being made here is that "this won't affect performance", so all I'm asking for is to be provided with the evidence that shows this assertion to be true. This is part of the reason I suggested breaking this up into separate bug fix and removal patches - a pure bug fix doesn't need performance regression testing to be done. Further, having the bug fix separate to changing the behaviour of the code mitigates the risk of finding an unexpected performance regression from changing behaviour. Combining the bug fix with a significant change of behaviour doesn't provide us with a simple method of addressing such a regression... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx