On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote: > On 9/8/21 12:26 AM, Jeff Layton wrote: > > On Fri, 2021-09-03 at 16:15 +0800, xiubli@xxxxxxxxxx wrote: > > > From: Xiubo Li <xiubli@xxxxxxxxxx> > > > > > > When truncating the file, it will leave the MDS to handle that, > > > but the MDS won't update the file contents. So when the fscrypt > > > is enabled, if the truncate size is not aligned to the block size, > > > the kclient will round up the truancate size to the block size and > > > leave the the last block untouched. > > > > > > The opaque fscrypt_file field will be used to tricker whether the > > > last block need to do the rmw to truncate the a specified block's > > > contents, we can get which block needs to do the rmw by round down > > > the fscrypt_file. > > > > > > In kclient side, there is not need to do the rmw immediately after > > > the file is truncated. We can defer doing that whenever the kclient > > > will update that block in late future. And before that any kclient > > > will check the fscrypt_file field when reading that block, if the > > > fscrypt_file field is none zero that means the related block needs > > > to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE)) > > > in pagecache or readed data buffer. > > > > > s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/ > > > > Yes, on x86_64 they are equal, but that's not the case on all arches. > > Also, we are moving toward a pagecache that may hold larger pages on > > x86_64 too. > > > Okay. > > > Once the that block contents are updated and writeback > > > kclient will reset the fscrypt_file field in MDS side, then 0 means > > > no need to care about the truncate stuff any more. > > > > > I'm a little unclear on what the fscrypt_file actually represents here. > > > > I had proposed that we just make the fscrypt_file field hold the > > "actual" i_size and we'd make the old size field always be a rounded- > > up version of the size. The MDS would treat that as an opaque value > > under Fw caps, and the client could use that field to determine i_size. > > That has a side benefit too -- if the client doesn't support fscrypt, > > it'll see the rounded-up sizes which are close enough and don't violate > > any POSIX rules. > > > > In your version, fscrypt_file also holds the actual size of the inode, > > but sometimes you're zeroing it out, and I don't understand why: > > I think I forgot to fix this after I adapt to multiple ftruncates case, > this patch is not correctly handling the "actual" file size. > > I just want the fscrypt_file field always to hold the offset from which > the contents needed to be zeroed, and the range should be [fscrypt_file, > round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)). > > In single ftruncate case the fscrypt_file should equal to the "actual" > file size. Then the "req->r_args.setattr.size = attr->ia_size" and > "req->r_args.setattr.old_size = isize", no need to round up in > __ceph_setattr() in kclient side, and leave the MDS to do that, but we > need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time. > I'm really not a fan of pushing this logic into the MDS. Why does it need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all? > > But in multiple ftruncates case if the second ftruncate has a larger > size, the fscrypt_file won't be changed and the ceph backend will help > zero the extended part, but we still need to zero the contents in the > first ftruncate. If the second ftruncate has a smaller size, the > fscrypt_file need to be updated and always keeps the smaller size. > I don't get it. Maybe you can walk me through a concrete example of how racing ftruncates are a problem? Suppose we have a file and client1 truncates it down from a large size to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and does a RMW on the last (4k) block. client2 comes along at the same time and truncates it up to 13k. client2 issues a SETATTR to extend the file to 16k and does a RMW on the last block too (which would presumably already be all zeroes anyway). Maybe I'm missing something obvious, but I don't see how any of this is an issue? Where does the race come in? > > > 1) What benefit does this extra complexity provide? > > Just in case for multiple ftruncates case as above. > > > 2) once you zero out that value, how do you know what the real i_size > > is? I assume the old size field will hold a rounded-up size (mostly for > > the benefit of the MDS), so you don't have a record of the real i_size > > at that point. > > As above mentioned, the MDS will help do the round up instead and the > "setattr.size" and "setattr.old_size" won't do the round up any more in > kclient side. > I really don't understand what this is fixing and why you need to push the rounding logic into the MDS. fscrypt is largely a client side feature, and I think it'd be best to keep as much of the logic in the client as possible. > > > > Note that there is no reason you can't store _more_ than just a size in > > fscrypt_file. For example, If you need extra flags to indicate > > truncation/rmw state, then you could just make it hold a struct that has > > the size and a flags field. > > Yeah, a struct is a good idea, something likes: > > struct fscrypt_file { > > u64 truncate_offset; > > u8 fscrypt_block_size; // in kbytes > > ... > > }; > > > > > > > There has one special case and that is when there have 2 ftruncates > > > are called: > > > > > > 1) If the second's size equals to the first one, do nothing about > > > the fscrypt_file. > > > 2) If the second's size is smaller than the first one, then we need > > > to update the fscrypt_file with new size. > > > 3) If the second's size is larger than the first one, then we must > > > leave what the fscrypt_file is. Because we always need to truncate > > > more. > > > > > > Add one CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE flag in the cap reqeust > > > to tell the MDS to reset the scrypt_file field once the specified > > > block has been updated, so there still need to adapt to this in the > > > MDS PR. > > > > > > And also this patch just assume that all the buffer and none buffer > > > read/write related enscrypt/descrypt work has been done. > > > > > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > > > --- > > > fs/ceph/addr.c | 19 ++++++++++++++- > > > fs/ceph/caps.c | 24 +++++++++++++++++++ > > > fs/ceph/file.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++--- > > > fs/ceph/inode.c | 27 ++++++++++++++++++--- > > > fs/ceph/super.h | 12 ++++++++-- > > > 5 files changed, 135 insertions(+), 9 deletions(-) > > > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > > > index 6d3f74d46e5b..9f1dd2fc427d 100644 > > > --- a/fs/ceph/addr.c > > > +++ b/fs/ceph/addr.c > > > @@ -212,6 +212,7 @@ static bool ceph_netfs_clamp_length(struct netfs_read_subrequest *subreq) > > > static void finish_netfs_read(struct ceph_osd_request *req) > > > { > > > struct ceph_fs_client *fsc = ceph_inode_to_client(req->r_inode); > > > + struct ceph_inode_info *ci = ceph_inode(req->r_inode); > > > struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0); > > > struct netfs_read_subrequest *subreq = req->r_priv; > > > int num_pages; > > > @@ -234,8 +235,15 @@ static void finish_netfs_read(struct ceph_osd_request *req) > > > > > > netfs_subreq_terminated(subreq, err, true); > > > > > > + /* FIXME: This should be done after descryped */ > > > + if (req->r_result > 0) > > > + ceph_try_to_zero_truncate_block_off(ci, osd_data->alignment, > > > + osd_data->length, > > > + osd_data->pages); > > > + > > The 3rd arg to ceph_try_to_zero_truncate_block_off is end_pos, but here > > you're passing in the length of the OSD write. Doesn't that need to > > added to the pos of the write? > > Yeah, it should be the page_off instead. > > > > > > I'm also a little unclear as to why you need to adjust the truncation > > offset at this point. > > It won't, the ceph_try_to_zero_truncate_block_off() will only zero the > readed pages in range of [fscrypt_off, round_up(fscrypt_file > +CEPH_FSCRYPT_BLOCK_SIZE)). > > > > > num_pages = calc_pages_for(osd_data->alignment, osd_data->length); > > > ceph_put_page_vector(osd_data->pages, num_pages, false); > > > + > > > iput(req->r_inode); > > > } > > > > > > @@ -555,8 +563,10 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc) > > > req->r_end_latency, len, err); > > > > > > ceph_osdc_put_request(req); > > > - if (err == 0) > > > + if (err == 0) { > > > + ceph_reset_truncate_block_off(ci, page_off, len); > > > err = len; > > > + } > > > > > > if (err < 0) { > > > struct writeback_control tmp_wbc; > > > @@ -661,10 +671,17 @@ static void writepages_finish(struct ceph_osd_request *req) > > > (u64)osd_data->length); > > > total_pages += num_pages; > > > for (j = 0; j < num_pages; j++) { > > > + u64 page_off; > > > + > > > page = osd_data->pages[j]; > > > BUG_ON(!page); > > > WARN_ON(!PageUptodate(page)); > > > > > > + page_off = osd_data->alignment + j * PAGE_SIZE; > > > + if (rc >= 0) > > > + ceph_reset_truncate_block_off(ci, page_off, > > > + page_off + PAGE_SIZE); > > > + > > > if (atomic_long_dec_return(&fsc->writeback_count) < > > > CONGESTION_OFF_THRESH( > > > fsc->mount_options->congestion_kb)) > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > > index d628dcdbf869..a211ab4c3f7a 100644 > > > --- a/fs/ceph/caps.c > > > +++ b/fs/ceph/caps.c > > > @@ -1425,6 +1425,9 @@ static vthatoid __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap, > > > } > > > } > > > } > > > + if (ci->i_truncate_block_off < 0) > > > + flags |= CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE;It sounds like you want to do something quite different from what I originally proposed. > > > + > > > arg->flags = flags; > > > arg->encrypted = IS_ENCRYPTED(inode); > > > if (ci->fscrypt_auth_len && > > > @@ -3155,6 +3158,27 @@ void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci, int had) > > > __ceph_put_cap_refs(ci, had, PUT_CAP_REFS_NO_CHECK); > > > } > > > > > > +/* > > > + * Clear the i_truncate_block_off and fscrypt_file > > > + * if old last encrypted block has been updated. > > > + */ > > > +void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci, > > > + u64 start_pos, u64 end_pos) > > > +{ > > > + if (ci->i_truncate_block_off > 0 && > > > + ci->i_truncate_block_off >= start_pos && > > > + ci->i_truncate_block_off < end_pos) > > > + ci->i_truncate_block_off = 0; > > > +} > > > + > > > +void ceph_reset_truncate_block_off(struct ceph_inode_info *ci, > > > + u64 start_pos, u64 end_pos) > > > +{ > > > + spin_lock(&ci->i_ceph_lock); > > > + __ceph_reset_truncate_block_off(ci, start_pos, end_pos); > > > + spin_unlock(&ci->i_ceph_lock); > > > +} > > > + > > > /* > > > * Release @nr WRBUFFER refs on dirty pages for the given @snapc snap > > > * context. Adjust per-snap dirty page accounting as appropriate. > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > > > index 6e677b40410e..cfa4cbe08c10 100644 > > > --- a/fs/ceph/file.c > > > +++ b/fs/ceph/file.c > > > @@ -885,10 +885,34 @@ static void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len) > > > } > > > } > > > > > > +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci, > > > + u64 start_pos, u64 end_pos, > > > + struct page **pages) > > > +{ > > > + u64 zoff, zlen; > > > + > > > + spin_lock(&ci->i_ceph_lock); > > > + if (ci->i_truncate_block_off >= start_pos && > > > + ci->i_truncate_block_off < end_pos) { > > > + zoff = ci->i_truncate_block_off - start_pos; > > > + zlen = round_up(ci->i_truncate_block_off, PAGE_SIZE) - ci->i_truncate_block_off; > > > + > > > + spin_unlock(&ci->i_ceph_lock); > > > + ceph_zero_page_vector_range(zoff, zlen, pages); > > > + spin_lock(&ci->i_ceph_lock); > > > + } > > > + spin_unlock(&ci->i_ceph_lock); > > > +} > > > #else > > > static inline void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len) > > > { > > > } > > > + > > > +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci, > > > + u64 start_pos, u64 end_pos, > > > + struct page **pages) > > > +{ > > > +} > > > #endif > > > > > > /* > > > @@ -1030,6 +1054,13 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to, > > > ret += zlen; > > > } > > > > > > + /* > > > + * If the inode is ENCRYPTED the read_off is aligned to PAGE_SIZE > > > + */ > > > + ceph_try_to_zero_truncate_block_off(ci, read_off, > > > + read_off + read_len, > > > + pages); > > > + > > > idx = 0; > > > left = ret > 0 ? ret : 0; > > > while (left > 0) { > > > @@ -1413,12 +1444,34 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, > > > > > > size = i_size_read(inode); > > > if (!write) { > > > + struct iov_iter i; > > > + size_t boff; > > > + int zlen; > > > + > > > if (ret == -ENOENT) > > > ret = 0; > > > + > > > + /* Zero the truncate block off */ > > > + spin_lock(&ci->i_ceph_lock); > > > + boff = ci->i_truncate_block_off; > > > + if (IS_ENCRYPTED(inode) && ret > 0 && boff > 0 && > > > + boff >= (iocb->ki_pos & PAGE_MASK) && > > > + boff < round_up(ret, PAGE_SIZE)) { > > > + int advance = 0; > > > +rovide? > > > + zlen = round_up(boff, PAGE_SIZE) - boff; > > > + if (ci->i_truncate_block_off >= iocb->ki_pos) > > > + advance = boff - iocb->ki_pos; > > > + > > > + iov_iter_bvec(&i, READ, bvecs, num_pages, len); > > > + iov_iter_advance(&i, advance); > > > + iov_iter_zero(zlen, &i); > > > + } > > > + spin_unlock(&ci->i_ceph_lock); > > > + > > > if (ret >= 0 && ret < len && pos + ret < size) { > > > - struct iov_iter i; > > > - int zlen = min_t(size_t, len - ret, > > > - size - pos - ret); > > > + zlen = min_t(size_t, len - ret, > > > + size - pos - ret); > > > > > > iov_iter_bvec(&i, READ, bvecs, num_pages, len); > > > iov_iter_advance(&i, ret); > > > @@ -1967,6 +2020,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from) > > > struct ceph_fs_client *fsc = ceph_inode_to_client(inode); > > > struct ceph_osd_client *osdc = &fsc->client->osdc; > > > struct ceph_cap_flush *prealloc_cf; > > > + u64 start_pos = iocb->ki_pos; > > > ssize_t count, written = 0; > > > int err, want, got; > > > bool direct_lock = false; > > > @@ -2110,6 +2164,8 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from) > > > int dirty; > > > > > > spin_lock(&ci->i_ceph_lock); > > > + __ceph_reset_truncate_block_off(ci, start_pos, iocb->ki_pos); > > > + > > > ci->i_inline_version = CEPH_INLINE_NONE; > > > dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, > > > &prealloc_cf); > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > > > index 1a4c9bc485fc..c48c77c1bcf4 100644 > > > --- a/fs/ceph/inode.c > > > +++ b/fs/ceph/inode.c > > > @@ -625,6 +625,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) > > > ci->fscrypt_auth = NULL; > > > ci->fscrypt_auth_len = 0; > > > #endif > > > + ci->i_truncate_block_off = 0; > > > > > > return &ci->vfs_inode; > > > } > > > @@ -1033,11 +1034,24 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page, > > > > > > pool_ns = old_ns; > > > > > > + /* the fscrypt_file is 0 means the file content truncate has been done */ > > > if (IS_ENCRYPTED(inode) && size && > > > - (iinfo->fscrypt_file_len == sizeof(__le64))) { > > > + iinfo->fscrypt_file_len == sizeof(__le64) && > > > + __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file) > 0) { > > > size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file); > > > if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE)) > > > pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size); > > > + > > > + /* > > > + * If the second truncate come just after the first > > > + * truncate, and if the second has a larger size there > > > + * is no need to update the i_truncate_block_off. > > > + * Only when the second one has a smaller size, that > > > + * means we need to truncate more. > > > + */ > > > + if (ci->i_truncate_block_off > 0 && > > > + size < ci->i_truncate_block_off) > > > + ci->i_truncate_block_off = size; > > > } > > > > > > queue_trunc = ceph_fill_file_size(inode, issued, > > > @@ -2390,8 +2404,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c > > > req->r_args.setattr.old_size = > > > cpu_to_le64(round_up(isize, > > > CEPH_FSCRYPT_BLOCK_SIZE)); > > > - req->r_fscrypt_file = attr->ia_size; > > > - /* FIXME: client must zero out any partial blocks! */ > > > + if (attr->ia_size < isize) { > > > + if(IS_ALIGNED(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE)) > > > + req->r_fscrypt_file = 0; > > > + else > > > + req->r_fscrypt_file = attr->ia_size; > > > + /* FIXME: client must zero out any partial blocks! */ > > > + } else if (attr->ia_size > isize) { > > > + req->r_fscrypt_file = attr->ia_size; > > > + } > > > } else { > > > req->r_args.setattr.size = cpu_to_le64(attr->ia_size); > > > req->r_args.setattr.old_size = cpu_to_le64(isize); > > > diff --git a/fs/ceph/super.h b/fs/ceph/super.h > > > index 7f3976b3319d..856caeb25fb6 100644 > > > --- a/fs/ceph/super.h > > > +++ b/fs/ceph/super.h > > > @@ -443,9 +443,9 @@ struct ceph_inode_info { > > > struct fscache_cookie *fscache; > > > #endif > > > u32 fscrypt_auth_len; > > > - u32 fscrypt_file_len; > > > u8 *fscrypt_auth; > > > - u8 *fscrypt_file; > > > + /* need to zero the last block when decrypting the content to pagecache */ > > > + size_t i_truncate_block_off; > > > > > Ugh, do we really need yet another field in the inode? This seems > > totally unnecessary, but maybe I'm missing some subtlety in the > > truncation handling that requires this extra tracking. > > > > What is this intended to represent anyway? > > Just a draft patch. And I remove the unused "u8 *fscrypt_file" member. > > We can remove this and just reuse the "u8 *fscrypt_file" here. > > After the inode is filled, this will keep the offset from which it needs > to do the zero stuff after reading to the local pagecache or buffer. > > > > errseq_t i_meta_err; > > > > > > @@ -1192,6 +1192,10 @@ extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had); > > > extern void ceph_put_cap_refs_async(struct ceph_inode_info *ci, int had); > > > extern void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci, > > > int had); > > > +extern void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci, > > > + u64 start_pos, u64 end_pos); > > > +extern void ceph_reset_truncate_block_off(struct ceph_inode_info *ci, > > > + u64 start_pos, u64 end_pos); > > > extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr, > > > struct ceph_snap_context *snapc); > > > extern void ceph_flush_snaps(struct ceph_inode_info *ci, > > > @@ -1282,6 +1286,10 @@ extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks, > > > extern void ceph_fs_debugfs_init(struct ceph_fs_client *client); > > > extern void ceph_fs_debugfs_cleanup(struct ceph_fs_client *client); > > > > > > +extern void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci, > > > + u64 start_pos, u64 end_pos, > > > + struct page **pages); > > > + > > > /* quota.c */ > > > static inline bool __ceph_has_any_quota(struct ceph_inode_info *ci) > > > { > -- Jeff Layton <jlayton@xxxxxxxxxx>