Jeff Layton <jlayton@xxxxxxxxxx> writes: > On Thu, 2020-11-12 at 10:40 +0000, Luis Henriques wrote: >> Jeff Layton <jlayton@xxxxxxxxxx> writes: >> >> > On Wed, 2020-11-11 at 18:28 +0000, Luis Henriques wrote: >> > > Jeff Layton <jlayton@xxxxxxxxxx> writes: >> > > >> > > > On Wed, 2020-11-11 at 15:39 +0000, Luis Henriques wrote: >> > > > > When doing a rename across quota realms, there's a corner case that isn't >> > > > > handled correctly. Here's a testcase: >> > > > > >> > > > > mkdir files limit >> > > > > truncate files/file -s 10G >> > > > > setfattr limit -n ceph.quota.max_bytes -v 1000000 >> > > > > mv files limit/ >> > > > > >> > > > > The above will succeed because ftruncate(2) won't result in an immediate >> > > > > notification of the MDSs with the new file size, and thus the quota realms >> > > > > stats won't be updated. >> > > > > >> > > > > This patch forces a sync with the MDS every time there's an ATTR_SIZE that >> > > > > sets a new i_size, even if we have Fx caps. >> > > > > >> > > > > Cc: stable@xxxxxxxxxxxxxxx >> > > > > Fixes: dffdcd71458e ("ceph: allow rename operation under different quota realms") >> > > > > URL: https://tracker.ceph.com/issues/36593 >> > > > > Signed-off-by: Luis Henriques <lhenriques@xxxxxxx> >> > > > > --- >> > > > > fs/ceph/inode.c | 11 ++--------- >> > > > > 1 file changed, 2 insertions(+), 9 deletions(-) >> > > > > >> > > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c >> > > > > index 526faf4778ce..30e3f240ac96 100644 >> > > > > --- a/fs/ceph/inode.c >> > > > > +++ b/fs/ceph/inode.c >> > > > > @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr) >> > > > > if (ia_valid & ATTR_SIZE) { >> > > > > dout("setattr %p size %lld -> %lld\n", inode, >> > > > > inode->i_size, attr->ia_size); >> > > > > - if ((issued & CEPH_CAP_FILE_EXCL) && >> > > > > - attr->ia_size > inode->i_size) { >> > > > > - i_size_write(inode, attr->ia_size); >> > > > > - inode->i_blocks = calc_inode_blocks(attr->ia_size); >> > > > > - ci->i_reported_size = attr->ia_size; >> > > > > - dirtied |= CEPH_CAP_FILE_EXCL; >> > > > > - ia_valid |= ATTR_MTIME; >> > > > > - } else if ((issued & CEPH_CAP_FILE_SHARED) == 0 || >> > > > > - attr->ia_size != inode->i_size) { >> > > > > + if ((issued & (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) || >> > > > > + (attr->ia_size != inode->i_size)) { >> > > > > req->r_args.setattr.size = cpu_to_le64(attr->ia_size); >> > > > > req->r_args.setattr.old_size = >> > > > > cpu_to_le64(inode->i_size); >> > > > >> > > > Hmm...this makes truncates more expensive when we have caps. I'd rather >> > > > not do that if we can help it. >> > > >> > > Yeah, as I mentioned in the tracker, there's indeed a performance impact >> > > with this fix. That's what made me add the RFC in the subject ;-) >> > > >> > > > What about instead having the client mimic a fsync when there is a >> > > > rename across quota realms? If we can't tell that reliably then we could >> > > > also just do an effective fsync ahead of any cross-directory rename? >> > > >> > > Ok, thanks for the suggestion. That may actually work, although it will >> > > make the rename more expensive of course. I'll test that tomorrow and >> > > eventually follow-up with a patch. >> > > >> > >> > Patrick pointed out to me on IRC that since you're moving the parent >> > directory of the truncated file, flushing the caps on the directory >> > won't really help. You'd need to walk the entire subtree and try to >> > flush every dirty inode, or basically do a syncfs() prior to renaming >> > the directory across quotarealms. >> > >> > I think we probably will need to revert the change to allow cross- >> > quotarealm renames of directories and make those return EXDEV again. >> > Anything else sounds like it's probably going to be too expensive. >> >> Hmm... that sounds a bit drastic and it would make the kernel client >> behave differently from the fuse client -- from what I could understand >> the fuse client does the sync ATTR_SIZE and thus doesn't have this issue. >> > > True. I'll note that the fuse client is not exactly built for speed, > however. > >> Obviously, I agree with you that the performance penalty is too high for >> such a common operation. But maybe renames across quotarealms aren't that >> common and paying the penalty of doing a full ceph_flush_dirty_caps() is >> acceptable for such cases? >> > > I wouldn't even do that. If someone is renaming a directory across > quotarealms, just return EXDEV. Saying "sorry, you have to copy/unlink" > in this situation seems like it should be acceptable. Are you aware of > any specific use-cases where people are renaming large directories > across quotarealms? No, no specific user-cases I'm aware of. The reasoning was simply an issue[1] in the tracker created by Greg. Basically fuse client commit b8954e5734b3 ("client: optimize rename operation under different quota root"), which has it's own issues[2][3] associated, triggered the implementation of the same behaviour on the kernel client. Anyway, I'm send the revert of dffdcd71458e ("ceph: allow rename operation under different quota realms") in a second. [1] https://tracker.ceph.com/issues/44791 [2] https://tracker.ceph.com/issues/39715 [3] https://tracker.ceph.com/issues/16884 Cheers, -- Luis