AL> What's going on there? Around 3.18 (before we went upstream) I got rid of .aio_read and .aio_write and "replaced" them with implementations of .read_iter and .write_iter. Then, in Linux 4.0, generic_write_checks got iterized too. -inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count) +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) In my 4.1 rebase I refactored a bunch of un-iterized stuff in file.c and tried to retain functionality... some code (the "suspicious as hell" code) I iterized and moved from do_readv_writev to orangefs_file_write_iter. AL> _Then_ there's the deadlock on 32bit SMP in that code. Yeah, that doesn't need to be there... I'm a little afraid to say who put it there in commit 5ecfcb2... <g> Hopefully Martin's suggested fix will do for now. We'll try to send it up - I guess next merge window is the right time? In one of the upcoming merge windows, I hope we have some page-cache related code to send up that will get rid of all of that... As far as the cumbersome C expression is concerned... I'm reminded of this cartoon I drew... https://sites.google.com/site/hubcapsite3/misc/tinFoilHat.jpg -Mike On Fri, Dec 8, 2017 at 11:39 AM, <martin@xxxxxxxxxxxx> wrote: > On Thu, Dec 07, 2017 at 10:26:10PM +0000, Al Viro wrote: >> I'd missed that back then, but... >> >> if (file->f_pos > i_size_read(file->f_mapping->host)) >> orangefs_i_size_write(file->f_mapping->host, file->f_pos); >> >> rc = generic_write_checks(iocb, iter); >> >> if (rc <= 0) { >> gossip_err("%s: generic_write_checks failed, rc:%zd:.\n", >> __func__, rc); >> goto out; >> } >> >> /* >> * if we are appending, generic_write_checks would have updated >> * pos to the end of the file, so we will wait till now to set >> * pos... >> */ >> pos = *(&iocb->ki_pos); >> >> looks suspicious as hell. What's going on there? Not to mention anything >> else file->f_pos might be completely unrelated to any IO going on - >> consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing >> to do with file->f_pos. Then there's the question of WTF is write() >> (or pwrite()) past the current EOF doing bumping the file size, before >> it even gets a chance to decide whether it'll be trying to do any IO at >> all. > > It seems to be a poor attempt at doing what the code immediately above > it does. > > http://dev.orangefs.org/trac/orangefs/changeset/4828 > > "put the i_size update at the beginning instead of the end. The issue is > that generic_write_checks sets the offset to i_size in the case of > append, and if i_pos has changed (so that its > i_size) outside of > write, than we need to update the i_size to reflect that value before > the offset is updated." > > Oh lovely, he's just moving it. > > http://dev.orangefs.org/trac/orangefs/changeset/4827 > > "fix for pwrite04 test that does a pwrite with open(O_APPEND). > > "fix for f_frsize (fragment size) to be equal to f_bsize. This should fix > newer statfs tools that use frsize as the block size instead of bsize. > Hopefully, it won't break anything that depended on frsize being > hardcoded to 1024." > > Which is just a rework of > > http://dev.orangefs.org/trac/orangefs/changeset/4723 > > "(from branch): fix for append bug" > > I'm not sure what branch he's talking about, and SVN seems to be > designed so as to make it impossible to find out. > > Note that nothing updates i_size after a successful write. So now we do > the getattr which he claims is ridiculous (it may be cached today though > it probably won't be) anywhere i_size is needed. In do_readv_writev we > invalidate the cached attribute so as to force an update from the > server. > > Well none of this does particular wonders for performance. None of it > really solves the problem, either. OrangeFS does not properly support > append writes, so we send a offset equal to the size of the file. It > completely fails when the file is being written to from another machine > (or even the same machine but not through the kernel). Without some > work on the OrangeFS server, solving that problem is impossible. > > Given that limitation, the best we can hope for is consistency on a > single machine. We could try to avoid the getattr by updating i_size > then not doing a getattr if within a cache timeout. That would be more > in line with the getattr cache I've implemented. > > I avoided it here because there was already an explicit getattr. I am > not sure under what circumstances it was added. I've looked through > Mike's pre-merge code; it's in there as early (3.15) as I can see. It's > not in the out-of-tree module. Perhaps Mike remembers why it was added. > > https://github.com/hubcapsc/linux/commit/60eb8a4f7fc5931ccec0372483ed405ef9ca9110 > >> >> _Then_ there's the deadlock on 32bit SMP in that code. Look: several >> lines prior we'd done >> inode_lock(file->f_mapping->host); >> and hadn't unlocked the sucker since then. And >> static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size) >> { >> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP) >> inode_lock(inode); >> #endif >> i_size_write(inode, i_size); >> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP) >> inode_unlock(inode); >> #endif >> } >> means that if we get around to calling it there in SMP/32bit case, we'll >> get as plain a deadlock as possible. And AFAICS it had been that way >> since the initial merge. >> >> What the hell is that code about and what is it trying to do? > > Guess what our original commit message is: "Trac 16" > > http://dev.orangefs.org/trac/orangefs/changeset/6709 > > Hardly seems to be any use trying to figure out what motivated it. I > don't think it was a deadlock then except to the extent that all the > OrangeFS code was a snarl of deadlocks then. But the obfuscation only > serves to hide it from us. > >> >> PS: While we are at it, what's the point of that *(&...) in there? > > https://github.com/hubcapsc/linux/commit/cae140731030471ec4782d65efefe8e819d6a467 > > Not sure why it's obfuscated but there's another one in > orangefs_file_read_iter. > > I intended to send a series that implements pagecache support in > OrangeFS at some point in the future. It's not quite done, but I > suppose it would do just as well to send it for review sooner rather > than later. All of this code has been significantly re-written there. > > Martin > > There is no xfstests regression with the following. > > diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c > index 1668fd645c45..0d228cd087e6 100644 > --- a/fs/orangefs/file.c > +++ b/fs/orangefs/file.c > @@ -452,7 +452,7 @@ ssize_t orangefs_inode_read(struct inode *inode, > static ssize_t orangefs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) > { > struct file *file = iocb->ki_filp; > - loff_t pos = *(&iocb->ki_pos); > + loff_t pos = iocb->ki_pos; > ssize_t rc = 0; > > BUG_ON(iocb->private); > @@ -492,9 +492,6 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite > } > } > > - if (file->f_pos > i_size_read(file->f_mapping->host)) > - orangefs_i_size_write(file->f_mapping->host, file->f_pos); > - > rc = generic_write_checks(iocb, iter); > > if (rc <= 0) { > @@ -508,7 +505,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite > * pos to the end of the file, so we will wait till now to set > * pos... > */ > - pos = *(&iocb->ki_pos); > + pos = iocb->ki_pos; > > rc = do_readv_writev(ORANGEFS_IO_WRITE, > file,