Elevated i_writecount doesn't guarantee ->release to be called

Jan Kara <jack@xxxxxxx> · Thu, 29 Jan 2015 13:46:30 +0100

  Changed subject and added linux-fsdevel to CC so that other developers
read this don't fall into the same trap :).

On Wed 28-01-15 22:45:34, Al Viro wrote:
> On Wed, Jan 28, 2015 at 01:45:24PM -0800, akpm@xxxxxxxxxxxxxxxxxxxx wrote:
> > atomic_t i_opencnt was used to free allocation in case there were no more
> > opens.  This patch replaces affs_file_open by generic_file_open and uses
> > FMODE_WRITE/i_writecount==1 for the task like other FS.
> 
> 
> >  affs_file_release(struct inode *inode, struct file *filp)
> >  {
> > -	pr_debug("release(%lu, %d)\n",
> > -		 inode->i_ino, atomic_read(&AFFS_I(inode)->i_opencnt));
> > +	pr_debug("release(%lu)\n", inode->i_ino);
> >  
> > -	if (atomic_dec_and_test(&AFFS_I(inode)->i_opencnt)) {
> > +	if ((filp->f_mode & FMODE_WRITE) &&
> > +	    (atomic_read(&inode->i_writecount) == 1)) {
> 
> I'm not at all convinced that this is correct for affs.  Or for anything
> else, for that matter.  Look: suppose somebody else is trying to open
> that sucker with O_TRUNC at that moment and they'd already gotten past
> get_write_access() in handle_truncate(), only to fail on locks_verify_locked().
> _That_ open() won't get anywhere near opening the file, so there won't be
> ->release() for it.  And our ->release() will see ->i_writecount greater
> than 1, due to get_write_access() done in handle_truncate() and still not
> balanced by coming put_write_access() in there - we'll call it after the
> locks_verify_locked() reports failure, but that hasn't happened yet.
> 
> Similar scenarios can almost certainly be constructed for other calls of
> get_write_access() as well, but this one is enough to NAK this patch, _and_
> to make the similar logics in other filesystems very suspicious...
  Thanks for pointing this out. You made me at look where exactly is
get_write_access() called and there are even places where we call it
without having file descriptor at all (e.g.  truncate path). So ext3, ext4,
udf, and gfs2 are racy. If we race, results aren't that bad (we just keep
preallocated blocks in the inode) but still it would be nice to fix.

Obviously we could maintain a private writecount in ->open() method but it
would seem a bit sad to do that for this mostly theoretical issue. Maybe we
just verify whether preallocation is truncated when evicting inode from
memory and if not, do it there. It's not perfect but even with current racy
solution noone noticed in practice.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html