Re: [RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()?

Mike Marshall <hubcap@xxxxxxxxxxxx> · Tue, 12 Dec 2017 11:31:06 -0500

AL> What's going on there?

Around 3.18 (before we went upstream) I got rid of .aio_read and
.aio_write and "replaced" them with implementations of .read_iter
and .write_iter.

Then, in Linux 4.0, generic_write_checks got iterized too.

-inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count)
+inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)

In my 4.1 rebase I refactored a bunch of un-iterized stuff in file.c and
tried to retain functionality... some code (the "suspicious as hell"
code) I iterized and moved from do_readv_writev to orangefs_file_write_iter.

AL> _Then_ there's the deadlock on 32bit SMP in that code.

Yeah, that doesn't need to be there... I'm a little afraid to say
who put it there in commit 5ecfcb2... <g>

Hopefully Martin's suggested fix will do for now. We'll try to send
it up - I guess next merge window is the right time? In one of the
upcoming merge windows, I hope we have some page-cache related code
to send up that will get rid of all of that...

As far as the cumbersome C expression is concerned... I'm reminded
of this cartoon I drew...

https://sites.google.com/site/hubcapsite3/misc/tinFoilHat.jpg

-Mike

On Fri, Dec 8, 2017 at 11:39 AM,  <martin@xxxxxxxxxxxx> wrote:
> On Thu, Dec 07, 2017 at 10:26:10PM +0000, Al Viro wrote:
>> I'd missed that back then, but...
>>
>>         if (file->f_pos > i_size_read(file->f_mapping->host))
>>                 orangefs_i_size_write(file->f_mapping->host, file->f_pos);
>>
>>         rc = generic_write_checks(iocb, iter);
>>
>>         if (rc <= 0) {
>>                 gossip_err("%s: generic_write_checks failed, rc:%zd:.\n",
>>                            __func__, rc);
>>                 goto out;
>>         }
>>
>>         /*
>>          * if we are appending, generic_write_checks would have updated
>>          * pos to the end of the file, so we will wait till now to set
>>          * pos...
>>          */
>>         pos = *(&iocb->ki_pos);
>>
>> looks suspicious as hell.  What's going on there?  Not to mention anything
>> else file->f_pos might be completely unrelated to any IO going on -
>> consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing
>> to do with file->f_pos.  Then there's the question of WTF is write()
>> (or pwrite()) past the current EOF doing bumping the file size, before
>> it even gets a chance to decide whether it'll be trying to do any IO at
>> all.
>
> It seems to be a poor attempt at doing what the code immediately above
> it does.
>
> http://dev.orangefs.org/trac/orangefs/changeset/4828
>
> "put the i_size update at the beginning instead of the end. The issue is
> that generic_write_checks sets the offset to i_size in the case of
> append, and if i_pos has changed (so that its > i_size) outside of
> write, than we need to update the i_size to reflect that value before
> the offset is updated."
>
> Oh lovely, he's just moving it.
>
> http://dev.orangefs.org/trac/orangefs/changeset/4827
>
> "fix for pwrite04 test that does a pwrite with open(O_APPEND).
>
> "fix for f_frsize (fragment size) to be equal to f_bsize. This should fix
> newer statfs tools that use frsize as the block size instead of bsize.
> Hopefully, it won't break anything that depended on frsize being
> hardcoded to 1024."
>
> Which is just a rework of
>
> http://dev.orangefs.org/trac/orangefs/changeset/4723
>
> "(from branch): fix for append bug"
>
> I'm not sure what branch he's talking about, and SVN seems to be
> designed so as to make it impossible to find out.
>
> Note that nothing updates i_size after a successful write.  So now we do
> the getattr which he claims is ridiculous (it may be cached today though
> it probably won't be) anywhere i_size is needed.  In do_readv_writev we
> invalidate the cached attribute so as to force an update from the
> server.
>
> Well none of this does particular wonders for performance.  None of it
> really solves the problem, either.  OrangeFS does not properly support
> append writes, so we send a offset equal to the size of the file.  It
> completely fails when the file is being written to from another machine
> (or even the same machine but not through the kernel).  Without some
> work on the OrangeFS server, solving that problem is impossible.
>
> Given that limitation, the best we can hope for is consistency on a
> single machine.  We could try to avoid the getattr by updating i_size
> then not doing a getattr if within a cache timeout.  That would be more
> in line with the getattr cache I've implemented.
>
> I avoided it here because there was already an explicit getattr.  I am
> not sure under what circumstances it was added.  I've looked through
> Mike's pre-merge code; it's in there as early (3.15) as I can see.  It's
> not in the out-of-tree module.  Perhaps Mike remembers why it was added.
>
> https://github.com/hubcapsc/linux/commit/60eb8a4f7fc5931ccec0372483ed405ef9ca9110
>
>>
>> _Then_ there's the deadlock on 32bit SMP in that code.  Look: several
>> lines prior we'd done
>>         inode_lock(file->f_mapping->host);
>> and hadn't unlocked the sucker since then.  And
>> static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
>> {
>> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>>         inode_lock(inode);
>> #endif
>>         i_size_write(inode, i_size);
>> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>>         inode_unlock(inode);
>> #endif
>> }
>> means that if we get around to calling it there in SMP/32bit case, we'll
>> get as plain a deadlock as possible.  And AFAICS it had been that way
>> since the initial merge.
>>
>> What the hell is that code about and what is it trying to do?
>
> Guess what our original commit message is: "Trac 16"
>
> http://dev.orangefs.org/trac/orangefs/changeset/6709
>
> Hardly seems to be any use trying to figure out what motivated it.  I
> don't think it was a deadlock then except to the extent that all the
> OrangeFS code was a snarl of deadlocks then.  But the obfuscation only
> serves to hide it from us.
>
>>
>> PS: While we are at it, what's the point of that *(&...) in there?
>
> https://github.com/hubcapsc/linux/commit/cae140731030471ec4782d65efefe8e819d6a467
>
> Not sure why it's obfuscated but there's another one in
> orangefs_file_read_iter.
>
> I intended to send a series that implements pagecache support in
> OrangeFS at some point in the future.  It's not quite done, but I
> suppose it would do just as well to send it for review sooner rather
> than later.  All of this code has been significantly re-written there.
>
> Martin
>
> There is no xfstests regression with the following.
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index 1668fd645c45..0d228cd087e6 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -452,7 +452,7 @@ ssize_t orangefs_inode_read(struct inode *inode,
>  static ssize_t orangefs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  {
>         struct file *file = iocb->ki_filp;
> -       loff_t pos = *(&iocb->ki_pos);
> +       loff_t pos = iocb->ki_pos;
>         ssize_t rc = 0;
>
>         BUG_ON(iocb->private);
> @@ -492,9 +492,6 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
>                 }
>         }
>
> -       if (file->f_pos > i_size_read(file->f_mapping->host))
> -               orangefs_i_size_write(file->f_mapping->host, file->f_pos);
> -
>         rc = generic_write_checks(iocb, iter);
>
>         if (rc <= 0) {
> @@ -508,7 +505,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
>          * pos to the end of the file, so we will wait till now to set
>          * pos...
>          */
> -       pos = *(&iocb->ki_pos);
> +       pos = iocb->ki_pos;
>
>         rc = do_readv_writev(ORANGEFS_IO_WRITE,
>                              file,