On Mon, Jan 13, 2025 at 9:50 AM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > On 1/4/25 5:00 PM, Chuck Lever wrote: > > On 12/30/24 10:33 AM, Chuck Lever wrote: > >> On 12/29/24 8:52 PM, Trond Myklebust wrote: > >>> On Sun, 2024-12-29 at 17:37 -0500, Chuck Lever wrote: > >>>> On 12/19/24 3:15 PM, Chuck Lever wrote: > >>>>> On 12/18/24 4:02 PM, Chuck Lever wrote: > >>>>>> Hi - > >>>>>> > >>>>>> I'm testing the NFSD support for attribute delegation, and seeing > >>>>>> these > >>>>>> two new fstests failures: generic/647 and generic/729. Both tests > >>>>>> emit > >>>>>> this error message: > >>>>>> > >>>>>> mmap-rw-fault: pread /media/test/mmap-rw-fault.tmp (O_DIRECT): > >>>>>> 0 != > >>>>>> 4096: Bad address > >>>>>> > >>>>>> This is 100% reproducible with the new patches applied to the > >>>>>> server, > >>>>>> and 100% not reproducible when they are not applied on the > >>>>>> server. > >>>>>> > >>>>>> The failure is due to pread64() (on the client) returning EFAULT. > >>>>>> On > >>>>>> the wire, the passing test does: > >>>>>> > >>>>>> SETATTR (size = 0) > >>>>>> WRITE (offset = 4096, len = 4096) > >>>>>> READ (offset = 0, len = 8192) > >>>>>> READ (offset = 4096, len = 4096) > >>>>>> SETATTR (size = 0) > >>>>>> [ continues until test passes ] > >>>>>> > >>>>>> The failing test does: > >>>>>> > >>>>>> SETATTR (size = 0) > >>>>>> WRITE (offset = 4096, len = 4096) > >>>>>> [ the failed pread64 seems to occur here ] > >>>>>> CLOSE > >>>>>> > >>>>>> In other words, in the failing case, the client does not emit > >>>>>> READs > >>>>>> to pull in the changed file content. > >>>>>> > >>>>>> The test is using O_DIRECT so I function-traced > >>>>>> nfs_direct_read_schedule_iovec(). In the passing case, this > >>>>>> function > >>>>>> generates the usual set of NFS READs on the wire and returns > >>>>>> successfully. > >>>>>> > >>>>>> In the failing case, iov_iter_get_pages_alloc2() invokes > >>>>>> get_user_pages_fast(), and that appears to fail immediately: > >>>>>> > >>>>>> mmap-rw-fault-623256 [016] 175303.310394: > >>>>>> funcgraph_entry: > >>>>>>> get_user_pages_fast() { > >>>>>> mmap-rw-fault-623256 [016] 175303.310395: > >>>>>> funcgraph_entry: > >>>>>>> gup_fast_fallback() { > >>>>>> mmap-rw-fault-623256 [016] 175303.310395: funcgraph_entry: > >>>>>> 0.262 > >>>>>> us | __pte_offset_map(); > >>>>>> mmap-rw-fault-623256 [016] 175303.310395: funcgraph_entry: > >>>>>> 0.142 > >>>>>> us | __rcu_read_unlock(); > >>>>>> mmap-rw-fault-623256 [016] 175303.310396: funcgraph_entry: > >>>>>> 7.824 > >>>>>> us | __gup_longterm_locked(); > >>>>>> mmap-rw-fault-623256 [016] 175303.310404: funcgraph_exit: > >>>>>> 8.967 us > >>>>>>> } > >>>>>> mmap-rw-fault-623256 [016] 175303.310404: funcgraph_exit: > >>>>>> 9.224 us > >>>>>>> } > >>>>>> mmap-rw-fault-623256 [016] 175303.310404: > >>>>>> funcgraph_entry: > >>>>>>> kvfree() { > >>>>>> > >>>>>> My guess is the cached inode file size is still zero. > >>>>> > >>>>> Confirmed: in the failing case, the read fails because the cached > >>>>> file size is still zero. In the passing case, the cached file size > >>>>> is > >>>>> 8192 before the read. > >>>>> > >>>>> During the test, the client truncates the file, then performs an > >>>>> NFS > >>>>> WRITE to the server, extending the size of the file. When an > >>>>> attribute > >>>>> delegation is in effect, that size extension isn't reflected in the > >>>>> cached value of i_size -- the client ensures that INVALID_SIZE is > >>>>> always clear. > >>>>> > >>>>> But perhaps the NFS client is relying on the client's VFS to > >>>>> maintain > >>>>> i_size...? The NFS client has its own direct I/O implementation, so > >>>>> perhaps an i_size update is missing there. > >>>> > >>>> Because the client never retrieves the file's size from the server > >>>> during either the passing or failing cases, this appears to be a > >>>> client > >>>> bug. > >>>> > >>>> The bug is in nfs_writeback_update_inode() -- if mtime is delegated, > >>>> it > >>>> skips the file extension check, and the file size cached on the > >>>> client > >>>> remains zero after the WRITE completes. > >>>> > >>>> The culprit is commit e12912d94137 ("NFSv4: Add support for delegated > >>>> atime and mtime attributes"). If I remove the hunk that this commit > >>>> adds to nfs_writeback_update_inode(), both generic/647 and > >>>> generic/729 > >>>> pass. > >>>> > >>>> > >>> > >>> I'm confused... If O_DIRECT is set on open(), then the NFSv4.x (x>0) > >>> client will set NFS4_SHARE_WANT_NO_DELEG. Furthermore, it should not > >>> set either NFS4_SHARE_WANT_DELEG_TIMESTAMPS or > >>> NFS4_SHARE_WANT_OPEN_XOR_DELEGATION. > >> > >> Examining wire captures... > >> > >> > >> In the passing test (done with NFSv4.1 to the same server), there is > >> indeed an OPEN with OPEN4_SHARE_ACCESS_WANT_NO_DELEG followed by the > >> WRITE to offset 4096 for 4096 bytes. The client returns this OPEN state > >> ID immediately (via CLOSE). > >> > >> Then an OPEN that returns both an OPEN state ID and a WRITE delegation. > >> The client uses the delegation state ID for reading, enabling the test > >> to pass. > > > > The above is not correct. Upon closer examination, the delegation state > > ID is used for the direct WRITE in this case, even though an OPEN state > > ID is available. > > > > But since nfs_have_delegated_mtime() returns false, > > nfs_writeback_update_inode() proceeds to update the cached file size. > > > > > >> There are three OPENs on the wire during the failing test. > >> > >> The first two set OPEN4_SHARE_ACCESS_WANT_NO_DELEG. For those, the > >> server returns an OPEN stateid, delegation type OPEN_DELEGATE_NONE_EXT, > >> and WND4_NOT_WANTED is set. > >> > >> The third OPEN appears to request any kind of open. The share_access > >> field contains the raw value 00300003. The rightmost "3" is SHARE_BOTH. > >> I assume the leftmost "3" means WANT_DELEG_TIMESTAMPS and OPEN_XOR; > >> wireshark doesn't currently recognize those bits. > >> > >> NFSD returns an OPEN_DELEGATE_WRITE_ATTRS_DELEG in response to this > >> request, with a delegation state ID and no OPEN state ID. > >> > >> The client uses this delegation state ID for subsequent write > >> operations. The write completions fail to extend the cached file > >> size due to the presence of the delegation. > > > > Here again the client presents the delegation state ID during the WRITE. > > But since the write delegation also permits delegated time stamps, > > nfs_writeback_update_inode() skips the file size update. > > > > In both cases, nfs4_select_rw_stateid() is choosing a delegation > > state ID for a direct WRITE. In the this case, it's choosing the > > delegation state ID because it has no OPEN state ID (due to > > OPEN_XOR_DELEG being set). > > > > nfs4_map_atomic_open_share() seems to be selecting the wrong > > bits to enable for this test... ? > > > > The test application opens the target file without O_DIRECT, performs > one or two chores, then closes the file. It then opens the same file > with O_DIRECT. > > The problem arises because, on close(2), the client caches the open > state acquired by the first (non-O_DIRECT) open(2). It emits neither a > CLOSE nor a DELEGRETURN. > > For the subsequent O_DIRECT open(2), a fresh OPEN is not emitted. The > client uses the cached open state for direct WRITEs -- in fact, it uses > the delegation state ID from that cached open state. The use of a delegation stateid doesn't seem to be a problem for when this test is run on 4.1. I think you had to correct before when the problem was with how the client deals with updating the attributes (now that it got a delegated attribute ability). The problematic chunk in the mentioned patch is the following: @@ -1514,6 +1516,13 @@ void nfs_writeback_update_inode(struct nfs_pgio_header *hdr) struct nfs_fattr *fattr = &hdr->fattr; struct inode *inode = hdr->inode; + if (nfs_have_delegated_mtime(inode)) { + spin_lock(&inode->i_lock); + nfs_set_cache_invalid(inode, NFS_INO_INVALID_BLOCKS); + spin_unlock(&inode->i_lock); + return; + } + spin_lock(&inode->i_lock); nfs_writeback_check_extend(hdr, fattr); nfs_post_op_update_inode_force_wcc_locked(inode, fattr); I'm not sure if reverting this chuck is the correct solution but it does fix the problem. > > Note that this happens independent of the minor version. For attribute > delegation, the client's write completion logic sees the active > delegation and skips updating the local file size. > > Seems like nfs4_opendata_find_nfs4_state() needs to recognize that an > O_DIRECT OPEN cannot use cached open state that was created by a non > O_DIRECT_OPEN ?? Or nfs4_set_rw_stateid() needs to recognize that an > I/O is direct, and choose only an open state ID. But that means there > needs to be a guarantee that an open state ID is available. > > Guidance would be appreciated. > > > -- > Chuck Lever >