On 1/30/25 3:43 PM, Olga Kornievskaia wrote: > On Mon, Jan 13, 2025 at 9:50 AM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >> >> On 1/4/25 5:00 PM, Chuck Lever wrote: >>> On 12/30/24 10:33 AM, Chuck Lever wrote: >>>> On 12/29/24 8:52 PM, Trond Myklebust wrote: >>>>> On Sun, 2024-12-29 at 17:37 -0500, Chuck Lever wrote: >>>>>> On 12/19/24 3:15 PM, Chuck Lever wrote: >>>>>>> On 12/18/24 4:02 PM, Chuck Lever wrote: >>>>>>>> Hi - >>>>>>>> >>>>>>>> I'm testing the NFSD support for attribute delegation, and seeing >>>>>>>> these >>>>>>>> two new fstests failures: generic/647 and generic/729. Both tests >>>>>>>> emit >>>>>>>> this error message: >>>>>>>> >>>>>>>> mmap-rw-fault: pread /media/test/mmap-rw-fault.tmp (O_DIRECT): >>>>>>>> 0 != >>>>>>>> 4096: Bad address >>>>>>>> >>>>>>>> This is 100% reproducible with the new patches applied to the >>>>>>>> server, >>>>>>>> and 100% not reproducible when they are not applied on the >>>>>>>> server. >>>>>>>> >>>>>>>> The failure is due to pread64() (on the client) returning EFAULT. >>>>>>>> On >>>>>>>> the wire, the passing test does: >>>>>>>> >>>>>>>> SETATTR (size = 0) >>>>>>>> WRITE (offset = 4096, len = 4096) >>>>>>>> READ (offset = 0, len = 8192) >>>>>>>> READ (offset = 4096, len = 4096) >>>>>>>> SETATTR (size = 0) >>>>>>>> [ continues until test passes ] >>>>>>>> >>>>>>>> The failing test does: >>>>>>>> >>>>>>>> SETATTR (size = 0) >>>>>>>> WRITE (offset = 4096, len = 4096) >>>>>>>> [ the failed pread64 seems to occur here ] >>>>>>>> CLOSE >>>>>>>> >>>>>>>> In other words, in the failing case, the client does not emit >>>>>>>> READs >>>>>>>> to pull in the changed file content. >>>>>>>> >>>>>>>> The test is using O_DIRECT so I function-traced >>>>>>>> nfs_direct_read_schedule_iovec(). In the passing case, this >>>>>>>> function >>>>>>>> generates the usual set of NFS READs on the wire and returns >>>>>>>> successfully. >>>>>>>> >>>>>>>> In the failing case, iov_iter_get_pages_alloc2() invokes >>>>>>>> get_user_pages_fast(), and that appears to fail immediately: >>>>>>>> >>>>>>>> mmap-rw-fault-623256 [016] 175303.310394: >>>>>>>> funcgraph_entry: >>>>>>>>> get_user_pages_fast() { >>>>>>>> mmap-rw-fault-623256 [016] 175303.310395: >>>>>>>> funcgraph_entry: >>>>>>>>> gup_fast_fallback() { >>>>>>>> mmap-rw-fault-623256 [016] 175303.310395: funcgraph_entry: >>>>>>>> 0.262 >>>>>>>> us | __pte_offset_map(); >>>>>>>> mmap-rw-fault-623256 [016] 175303.310395: funcgraph_entry: >>>>>>>> 0.142 >>>>>>>> us | __rcu_read_unlock(); >>>>>>>> mmap-rw-fault-623256 [016] 175303.310396: funcgraph_entry: >>>>>>>> 7.824 >>>>>>>> us | __gup_longterm_locked(); >>>>>>>> mmap-rw-fault-623256 [016] 175303.310404: funcgraph_exit: >>>>>>>> 8.967 us >>>>>>>>> } >>>>>>>> mmap-rw-fault-623256 [016] 175303.310404: funcgraph_exit: >>>>>>>> 9.224 us >>>>>>>>> } >>>>>>>> mmap-rw-fault-623256 [016] 175303.310404: >>>>>>>> funcgraph_entry: >>>>>>>>> kvfree() { >>>>>>>> >>>>>>>> My guess is the cached inode file size is still zero. >>>>>>> >>>>>>> Confirmed: in the failing case, the read fails because the cached >>>>>>> file size is still zero. In the passing case, the cached file size >>>>>>> is >>>>>>> 8192 before the read. >>>>>>> >>>>>>> During the test, the client truncates the file, then performs an >>>>>>> NFS >>>>>>> WRITE to the server, extending the size of the file. When an >>>>>>> attribute >>>>>>> delegation is in effect, that size extension isn't reflected in the >>>>>>> cached value of i_size -- the client ensures that INVALID_SIZE is >>>>>>> always clear. >>>>>>> >>>>>>> But perhaps the NFS client is relying on the client's VFS to >>>>>>> maintain >>>>>>> i_size...? The NFS client has its own direct I/O implementation, so >>>>>>> perhaps an i_size update is missing there. >>>>>> >>>>>> Because the client never retrieves the file's size from the server >>>>>> during either the passing or failing cases, this appears to be a >>>>>> client >>>>>> bug. >>>>>> >>>>>> The bug is in nfs_writeback_update_inode() -- if mtime is delegated, >>>>>> it >>>>>> skips the file extension check, and the file size cached on the >>>>>> client >>>>>> remains zero after the WRITE completes. >>>>>> >>>>>> The culprit is commit e12912d94137 ("NFSv4: Add support for delegated >>>>>> atime and mtime attributes"). If I remove the hunk that this commit >>>>>> adds to nfs_writeback_update_inode(), both generic/647 and >>>>>> generic/729 >>>>>> pass. >>>>>> >>>>>> >>>>> >>>>> I'm confused... If O_DIRECT is set on open(), then the NFSv4.x (x>0) >>>>> client will set NFS4_SHARE_WANT_NO_DELEG. Furthermore, it should not >>>>> set either NFS4_SHARE_WANT_DELEG_TIMESTAMPS or >>>>> NFS4_SHARE_WANT_OPEN_XOR_DELEGATION. >>>> >>>> Examining wire captures... >>>> >>>> >>>> In the passing test (done with NFSv4.1 to the same server), there is >>>> indeed an OPEN with OPEN4_SHARE_ACCESS_WANT_NO_DELEG followed by the >>>> WRITE to offset 4096 for 4096 bytes. The client returns this OPEN state >>>> ID immediately (via CLOSE). >>>> >>>> Then an OPEN that returns both an OPEN state ID and a WRITE delegation. >>>> The client uses the delegation state ID for reading, enabling the test >>>> to pass. >>> >>> The above is not correct. Upon closer examination, the delegation state >>> ID is used for the direct WRITE in this case, even though an OPEN state >>> ID is available. >>> >>> But since nfs_have_delegated_mtime() returns false, >>> nfs_writeback_update_inode() proceeds to update the cached file size. >>> >>> >>>> There are three OPENs on the wire during the failing test. >>>> >>>> The first two set OPEN4_SHARE_ACCESS_WANT_NO_DELEG. For those, the >>>> server returns an OPEN stateid, delegation type OPEN_DELEGATE_NONE_EXT, >>>> and WND4_NOT_WANTED is set. >>>> >>>> The third OPEN appears to request any kind of open. The share_access >>>> field contains the raw value 00300003. The rightmost "3" is SHARE_BOTH. >>>> I assume the leftmost "3" means WANT_DELEG_TIMESTAMPS and OPEN_XOR; >>>> wireshark doesn't currently recognize those bits. >>>> >>>> NFSD returns an OPEN_DELEGATE_WRITE_ATTRS_DELEG in response to this >>>> request, with a delegation state ID and no OPEN state ID. >>>> >>>> The client uses this delegation state ID for subsequent write >>>> operations. The write completions fail to extend the cached file >>>> size due to the presence of the delegation. >>> >>> Here again the client presents the delegation state ID during the WRITE. >>> But since the write delegation also permits delegated time stamps, >>> nfs_writeback_update_inode() skips the file size update. >>> >>> In both cases, nfs4_select_rw_stateid() is choosing a delegation >>> state ID for a direct WRITE. In the this case, it's choosing the >>> delegation state ID because it has no OPEN state ID (due to >>> OPEN_XOR_DELEG being set). >>> >>> nfs4_map_atomic_open_share() seems to be selecting the wrong >>> bits to enable for this test... ? >>> >> >> The test application opens the target file without O_DIRECT, performs >> one or two chores, then closes the file. It then opens the same file >> with O_DIRECT. >> >> The problem arises because, on close(2), the client caches the open >> state acquired by the first (non-O_DIRECT) open(2). It emits neither a >> CLOSE nor a DELEGRETURN. >> >> For the subsequent O_DIRECT open(2), a fresh OPEN is not emitted. The >> client uses the cached open state for direct WRITEs -- in fact, it uses >> the delegation state ID from that cached open state. > > The use of a delegation stateid doesn't seem to be a problem for when > this test is run on 4.1. > > I think you had to correct before when the problem was with how the > client deals with updating the attributes (now that it got a delegated > attribute ability). > > The problematic chunk in the mentioned patch is the following: > > @@ -1514,6 +1516,13 @@ void nfs_writeback_update_inode(struct > nfs_pgio_header *hdr) > struct nfs_fattr *fattr = &hdr->fattr; > struct inode *inode = hdr->inode; > > + if (nfs_have_delegated_mtime(inode)) { > + spin_lock(&inode->i_lock); > + nfs_set_cache_invalid(inode, NFS_INO_INVALID_BLOCKS); > + spin_unlock(&inode->i_lock); > + return; > + } > + > spin_lock(&inode->i_lock); > nfs_writeback_check_extend(hdr, fattr); > nfs_post_op_update_inode_force_wcc_locked(inode, fattr); > > > I'm not sure if reverting this chuck is the correct solution but it > does fix the problem. I also observed resolution when that logic was removed, but I'm sure that removal has undesirable side-effects. As Trond states above: > If O_DIRECT is set on open(), then the NFSv4.x (x>0) > client will set NFS4_SHARE_WANT_NO_DELEG. nfs_writeback_update_inode() assumes that a direct write will be using an open state ID, not a delegation state ID. Thus IMO the NFS client selects a delegation state ID for direct I/O in error. The application has closed the original file descriptor, but the client has cached the open and delegation state IDs. The client's open logic chooses to use the cached delegation state ID. I've opened https://bugzilla.kernel.org/show_bug.cgi?id=219738 to document this issue. >> Note that this happens independent of the minor version. For attribute >> delegation, the client's write completion logic sees the active >> delegation and skips updating the local file size. >> >> Seems like nfs4_opendata_find_nfs4_state() needs to recognize that an >> O_DIRECT OPEN cannot use cached open state that was created by a non >> O_DIRECT_OPEN ?? Or nfs4_set_rw_stateid() needs to recognize that an >> I/O is direct, and choose only an open state ID. But that means there >> needs to be a guarantee that an open state ID is available. >> >> Guidance would be appreciated. >> >> >> -- >> Chuck Lever >> -- Chuck Lever