On Tue, Jan 31, 2017 at 06:36:17PM -0800, Darrick J. Wong wrote: > Christoph Hellwig pointed out that there's a potentially nasty race when > performing simultaneous nearby directio cow writes: > > "Thread 1 writes a range from B to c > > " B --------- C > p > > "a little later thread 2 writes from A to B > > " A --------- B > p > > [editor's note: the 'p' denote cowextsize boundaries, which I added to > make this more clear] > > "but the code preallocates beyond B into the range where thread > "1 has just written, but ->end_io hasn't been called yet. I'm confused by the description. Wouldn't thread 1 have already allocated from B-C in the COW fork in order to send a write I/O? How could thread 2 then "preallocate into a range" of the COW fork that has been allocated by thread 1? (Also, wouldn't the ioend offset+size limit a COW remap to the range that is covered by the completed write?) Brian > "But once ->end_io is called thread 2 has already allocated > "up to the extent size hint into the write range of thread 1, > "so the end_io handler will splice the unintialized blocks from > "that preallocation back into the file right after B." > > We can avoid this race by ensuring that thread 1 cannot accidentally > remap the blocks that thread 2 allocated (as part of speculative > preallocation) as part of t2's write preparation in t1's end_io handler. > The way we make this happen is by taking advantage of the unwritten > extent flag as an intermediate step. > > Recall that when we begin the process of writing data to shared blocks, > we create a delayed allocation extent in the CoW fork: > > D: --RRRRRRSSSRRRRRRRR--- > C: ------DDDDDDD--------- > > When a thread prepares to CoW some dirty data out to disk, it will now > convert the delalloc reservation into an /unwritten/ allocated extent in > the cow fork. The da conversion code tries to opportunistically > allocate as much of a (speculatively prealloc'd) extent as possible, so > we may end up allocating a larger extent than we're actually writing > out: > > D: --RRRRRRSSSRRRRRRRR--- > U: ------UUUUUUU--------- > > Next, we convert only the part of the extent that we're actively > planning to write to normal (i.e. not unwritten) status: > > D: --RRRRRRSSSRRRRRRRR--- > U: ------UURRUUU--------- > > If the write succeeds, the end_cow function will now scan the relevant > range of the CoW fork for real extents and remap only the real extents > into the data fork: > > D: --RRRRRRRRSRRRRRRRR--- > U: ------UU--UUU--------- > > This ensures that we never obliterate valid data fork extents with > unwritten blocks from the CoW fork. > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > --- > v2: We don't update anything on disk for the CoW unwritten extent > conversion, so don't allocate a transaction. > --- > fs/xfs/xfs_aops.c | 6 +++ > fs/xfs/xfs_iomap.c | 2 - > fs/xfs/xfs_reflink.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++--- > fs/xfs/xfs_reflink.h | 2 + > fs/xfs/xfs_trace.h | 8 +++ > 5 files changed, 125 insertions(+), 10 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 631e7c0..1ff9df7 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -481,6 +481,12 @@ xfs_submit_ioend( > struct xfs_ioend *ioend, > int status) > { > + /* Convert CoW extents to regular */ > + if (!status && ioend->io_type == XFS_IO_COW) { > + status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode), > + ioend->io_offset, ioend->io_size); > + } > + > /* Reserve log space if we might write beyond the on-disk inode size. */ > if (!status && > ioend->io_type != XFS_IO_UNWRITTEN && > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 1aa3abd..767208f 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -685,7 +685,7 @@ xfs_iomap_write_allocate( > int nres; > > if (whichfork == XFS_COW_FORK) > - flags |= XFS_BMAPI_COWFORK; > + flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC; > > /* > * Make sure that the dquots are there. > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c > index 07593a3..a2e1fff 100644 > --- a/fs/xfs/xfs_reflink.c > +++ b/fs/xfs/xfs_reflink.c > @@ -82,11 +82,22 @@ > * mappings are a reservation against the free space in the filesystem; > * adjacent mappings can also be combined into fewer larger mappings. > * > + * As an optimization, the CoW extent size hint (cowextsz) creates > + * outsized aligned delalloc reservations in the hope of landing out of > + * order nearby CoW writes in a single extent on disk, thereby reducing > + * fragmentation and improving future performance. > + * > + * D: --RRRRRRSSSRRRRRRRR--- (data fork) > + * C: ------DDDDDDD--------- (CoW fork) > + * > * When dirty pages are being written out (typically in writepage), the > - * delalloc reservations are converted into real mappings by allocating > - * blocks and replacing the delalloc mapping with real ones. A delalloc > - * mapping can be replaced by several real ones if the free space is > - * fragmented. > + * delalloc reservations are converted into unwritten mappings by > + * allocating blocks and replacing the delalloc mapping with real ones. > + * A delalloc mapping can be replaced by several unwritten ones if the > + * free space is fragmented. > + * > + * D: --RRRRRRSSSRRRRRRRR--- > + * C: ------UUUUUUU--------- > * > * We want to adapt the delalloc mechanism for copy-on-write, since the > * write paths are similar. The first two steps (creating the reservation > @@ -101,13 +112,29 @@ > * Block-aligned directio writes will use the same mechanism as buffered > * writes. > * > + * Just prior to submitting the actual disk write requests, we convert > + * the extents representing the range of the file actually being written > + * (as opposed to extra pieces created for the cowextsize hint) to real > + * extents. This will become important in the next step: > + * > + * D: --RRRRRRSSSRRRRRRRR--- > + * C: ------UUrrUUU--------- > + * > * CoW remapping must be done after the data block write completes, > * because we don't want to destroy the old data fork map until we're sure > * the new block has been written. Since the new mappings are kept in a > * separate fork, we can simply iterate these mappings to find the ones > * that cover the file blocks that we just CoW'd. For each extent, simply > * unmap the corresponding range in the data fork, map the new range into > - * the data fork, and remove the extent from the CoW fork. > + * the data fork, and remove the extent from the CoW fork. Because of > + * the presence of the cowextsize hint, however, we must be careful > + * only to remap the blocks that we've actually written out -- we must > + * never remap delalloc reservations nor CoW staging blocks that have > + * yet to be written. This corresponds exactly to the real extents in > + * the CoW fork: > + * > + * D: --RRRRRRrrSRRRRRRRR--- > + * C: ------UU--UUU--------- > * > * Since the remapping operation can be applied to an arbitrary file > * range, we record the need for the remap step as a flag in the ioend > @@ -296,6 +323,66 @@ xfs_reflink_reserve_cow( > return 0; > } > > +/* Convert part of an unwritten CoW extent to a real one. */ > +STATIC int > +__xfs_reflink_convert_cow( > + struct xfs_inode *ip, > + struct xfs_bmbt_irec *imap, > + xfs_fileoff_t offset_fsb, > + xfs_filblks_t count_fsb, > + struct xfs_defer_ops *dfops) > +{ > + struct xfs_bmbt_irec irec = *imap; > + xfs_fsblock_t first_block; > + int nimaps = 1; > + > + xfs_trim_extent(&irec, offset_fsb, count_fsb); > + trace_xfs_reflink_convert_cow(ip, &irec); > + if (irec.br_blockcount == 0) > + return 0; > + return xfs_bmapi_write(NULL, ip, irec.br_startoff, irec.br_blockcount, > + XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, &first_block, > + 0, &irec, &nimaps, dfops); > +} > + > +/* Convert all of the unwritten CoW extents in a file's range to real ones. */ > +int > +xfs_reflink_convert_cow( > + struct xfs_inode *ip, > + xfs_off_t offset, > + xfs_off_t count) > +{ > + struct xfs_bmbt_irec got; > + struct xfs_defer_ops dfops; > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK); > + xfs_fileoff_t offset_fsb; > + xfs_fileoff_t end_fsb; > + xfs_extnum_t idx; > + bool found; > + int error; > + > + offset_fsb = XFS_B_TO_FSBT(mp, offset); > + end_fsb = XFS_B_TO_FSB(mp, offset + count); > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + > + /* Convert all the extents to real from unwritten. */ > + for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got); > + found && got.br_startoff < end_fsb; > + found = xfs_iext_get_extent(ifp, ++idx, &got)) { > + if (got.br_state == XFS_EXT_NORM) > + continue; > + error = __xfs_reflink_convert_cow(ip, &got, offset_fsb, > + end_fsb - offset_fsb, &dfops); > + if (error) > + break; > + } > + > + /* Finish up. */ > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + return error; > +} > + > /* Allocate all CoW reservations covering a range of blocks in a file. */ > static int > __xfs_reflink_allocate_cow( > @@ -328,6 +415,7 @@ __xfs_reflink_allocate_cow( > goto out_unlock; > ASSERT(nimaps == 1); > > + /* Make sure there's a CoW reservation for it. */ > error = xfs_reflink_reserve_cow(ip, &imap, &shared); > if (error) > goto out_trans_cancel; > @@ -337,14 +425,16 @@ __xfs_reflink_allocate_cow( > goto out_trans_cancel; > } > > + /* Allocate the entire reservation as unwritten blocks. */ > xfs_trans_ijoin(tp, ip, 0); > error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount, > - XFS_BMAPI_COWFORK, &first_block, > + XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, &first_block, > XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), > &imap, &nimaps, &dfops); > if (error) > goto out_trans_cancel; > > + /* Finish up. */ > error = xfs_defer_finish(&tp, &dfops, NULL); > if (error) > goto out_trans_cancel; > @@ -389,10 +479,13 @@ xfs_reflink_allocate_cow_range( > if (error) { > trace_xfs_reflink_allocate_cow_range_error(ip, error, > _RET_IP_); > - break; > + goto out; > } > } > > + /* Convert the CoW extents to regular. */ > + error = xfs_reflink_convert_cow(ip, offset, count); > +out: > return error; > } > > @@ -641,6 +734,16 @@ xfs_reflink_end_cow( > > ASSERT(!isnullstartblock(got.br_startblock)); > > + /* > + * Don't remap unwritten extents; these are > + * speculatively preallocated CoW extents that have been > + * allocated but have not yet been involved in a write. > + */ > + if (got.br_state == XFS_EXT_UNWRITTEN) { > + idx--; > + goto next_extent; > + } > + > /* Unmap the old blocks in the data fork. */ > xfs_defer_init(&dfops, &firstfsb); > rlen = del.br_blockcount; > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h > index aa6a4d6..1583c47 100644 > --- a/fs/xfs/xfs_reflink.h > +++ b/fs/xfs/xfs_reflink.h > @@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip, > struct xfs_bmbt_irec *imap, bool *shared); > extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip, > xfs_off_t offset, xfs_off_t count); > +extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset, > + xfs_off_t count); > extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset, > struct xfs_bmbt_irec *imap); > extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip, > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 69c5bcd..d3d11905 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -3089,6 +3089,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class, > __field(xfs_fileoff_t, lblk) > __field(xfs_extlen_t, len) > __field(xfs_fsblock_t, pblk) > + __field(int, state) > ), > TP_fast_assign( > __entry->dev = VFS_I(ip)->i_sb->s_dev; > @@ -3096,13 +3097,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class, > __entry->lblk = irec->br_startoff; > __entry->len = irec->br_blockcount; > __entry->pblk = irec->br_startblock; > + __entry->state = irec->br_state; > ), > - TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu", > + TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu st %d", > MAJOR(__entry->dev), MINOR(__entry->dev), > __entry->ino, > __entry->lblk, > __entry->len, > - __entry->pblk) > + __entry->pblk, > + __entry->state) > ); > #define DEFINE_INODE_IREC_EVENT(name) \ > DEFINE_EVENT(xfs_inode_irec_class, name, \ > @@ -3242,6 +3245,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared); > DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc); > DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found); > DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc); > +DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow); > > DEFINE_RW_EVENT(xfs_reflink_reserve_cow); > DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range); > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html