Re: [RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()?

martin@xxxxxxxxxxxx · Fri, 8 Dec 2017 11:39:21 -0500

On Thu, Dec 07, 2017 at 10:26:10PM +0000, Al Viro wrote:
> I'd missed that back then, but...
> 
>         if (file->f_pos > i_size_read(file->f_mapping->host))
>                 orangefs_i_size_write(file->f_mapping->host, file->f_pos);
> 
>         rc = generic_write_checks(iocb, iter);
> 
>         if (rc <= 0) {
>                 gossip_err("%s: generic_write_checks failed, rc:%zd:.\n",
>                            __func__, rc);
>                 goto out;
>         }
> 
>         /*
>          * if we are appending, generic_write_checks would have updated
>          * pos to the end of the file, so we will wait till now to set
>          * pos...
>          */
>         pos = *(&iocb->ki_pos);
> 
> looks suspicious as hell.  What's going on there?  Not to mention anything
> else file->f_pos might be completely unrelated to any IO going on -
> consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing
> to do with file->f_pos.  Then there's the question of WTF is write()
> (or pwrite()) past the current EOF doing bumping the file size, before
> it even gets a chance to decide whether it'll be trying to do any IO at
> all.

It seems to be a poor attempt at doing what the code immediately above
it does.

http://dev.orangefs.org/trac/orangefs/changeset/4828

"put the i_size update at the beginning instead of the end. The issue is
that generic_write_checks sets the offset to i_size in the case of
append, and if i_pos has changed (so that its > i_size) outside of
write, than we need to update the i_size to reflect that value before
the offset is updated."

Oh lovely, he's just moving it.

http://dev.orangefs.org/trac/orangefs/changeset/4827

"fix for pwrite04 test that does a pwrite with open(O_APPEND).

"fix for f_frsize (fragment size) to be equal to f_bsize. This should fix
newer statfs tools that use frsize as the block size instead of bsize.
Hopefully, it won't break anything that depended on frsize being
hardcoded to 1024."

Which is just a rework of

http://dev.orangefs.org/trac/orangefs/changeset/4723

"(from branch): fix for append bug"

I'm not sure what branch he's talking about, and SVN seems to be
designed so as to make it impossible to find out.

Note that nothing updates i_size after a successful write.  So now we do
the getattr which he claims is ridiculous (it may be cached today though
it probably won't be) anywhere i_size is needed.  In do_readv_writev we
invalidate the cached attribute so as to force an update from the
server.

Well none of this does particular wonders for performance.  None of it
really solves the problem, either.  OrangeFS does not properly support
append writes, so we send a offset equal to the size of the file.  It
completely fails when the file is being written to from another machine
(or even the same machine but not through the kernel).  Without some
work on the OrangeFS server, solving that problem is impossible.

Given that limitation, the best we can hope for is consistency on a
single machine.  We could try to avoid the getattr by updating i_size
then not doing a getattr if within a cache timeout.  That would be more
in line with the getattr cache I've implemented.

I avoided it here because there was already an explicit getattr.  I am
not sure under what circumstances it was added.  I've looked through
Mike's pre-merge code; it's in there as early (3.15) as I can see.  It's
not in the out-of-tree module.  Perhaps Mike remembers why it was added.

https://github.com/hubcapsc/linux/commit/60eb8a4f7fc5931ccec0372483ed405ef9ca9110

> 
> _Then_ there's the deadlock on 32bit SMP in that code.  Look: several
> lines prior we'd done
>         inode_lock(file->f_mapping->host);
> and hadn't unlocked the sucker since then.  And
> static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
> {
> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>         inode_lock(inode);
> #endif
>         i_size_write(inode, i_size);
> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>         inode_unlock(inode);
> #endif
> }
> means that if we get around to calling it there in SMP/32bit case, we'll
> get as plain a deadlock as possible.  And AFAICS it had been that way
> since the initial merge.
> 
> What the hell is that code about and what is it trying to do?

Guess what our original commit message is: "Trac 16"

http://dev.orangefs.org/trac/orangefs/changeset/6709

Hardly seems to be any use trying to figure out what motivated it.  I
don't think it was a deadlock then except to the extent that all the
OrangeFS code was a snarl of deadlocks then.  But the obfuscation only
serves to hide it from us.

> 
> PS: While we are at it, what's the point of that *(&...) in there?

https://github.com/hubcapsc/linux/commit/cae140731030471ec4782d65efefe8e819d6a467

Not sure why it's obfuscated but there's another one in
orangefs_file_read_iter.

I intended to send a series that implements pagecache support in
OrangeFS at some point in the future.  It's not quite done, but I
suppose it would do just as well to send it for review sooner rather
than later.  All of this code has been significantly re-written there.

Martin

There is no xfstests regression with the following.

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 1668fd645c45..0d228cd087e6 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -452,7 +452,7 @@ ssize_t orangefs_inode_read(struct inode *inode,
 static ssize_t orangefs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
-	loff_t pos = *(&iocb->ki_pos);
+	loff_t pos = iocb->ki_pos;
 	ssize_t rc = 0;
 
 	BUG_ON(iocb->private);
@@ -492,9 +492,6 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 		}
 	}
 
-	if (file->f_pos > i_size_read(file->f_mapping->host))
-		orangefs_i_size_write(file->f_mapping->host, file->f_pos);
-
 	rc = generic_write_checks(iocb, iter);
 
 	if (rc <= 0) {
@@ -508,7 +505,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 	 * pos to the end of the file, so we will wait till now to set
 	 * pos...
 	 */
-	pos = *(&iocb->ki_pos);
+	pos = iocb->ki_pos;
 
 	rc = do_readv_writev(ORANGEFS_IO_WRITE,
 			     file,