Re: Mapping range locking and related stuff

Jan Kara <jack@xxxxxxx> · Mon, 30 Sep 2013 18:25:09 +0200

On Sat 28-09-13 09:32:02, Dave Chinner wrote:
> On Fri, Sep 27, 2013 at 10:42:14PM +0200, Jan Kara wrote:
> >   Hello,
> > 
> >   so recently I've spent some time rummaging in get_user_pages(), fault
> > code etc. The use of mmap_sem is really messy in some places (like V4L
> > drivers, infiniband,...). It is held over a deep & wide call chains and
> > it's not clear what's protected by it, just in the middle of that is a call
> > to get_user_pages(). Anyway that's mostly a side note.
> > 
> > The main issue I found is with the range locking itself. Consider someone
> > doing:
> >   fd = open("foo", O_RDWR);
> >   base = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> >   write(fd, base, 4096);
> > 
> > The write() is an interesting way to do nothing but if the mapping range
> > lock will be acquired early (like in generic_file_aio_write()), then this
> > would deadlock because generic_perform_write() will try to fault in
> > destination buffer and that will try to get the range lock for the same
> > range again.
> 
> Quite frankly, I'd like to see EFAULT or EDEADLOCK returned to the
> caller doing something like this. It's a stupid thing to do, and
> while I beleive in giving people enough rope to hang themselves,
> the contortions we are going through here to provide that rope
> doesn't seem worthwhile at all.
  Well, what I wrote above was kind of minimal example. But if you would
like to solve all deadlocks of this kind (when more processes are
involved), you would have to forbid giving any file mapping (private or
shared) as a buffer for IO. And I'm afraid we cannot really afford that
because there may be reasonable uses for that.

Standards conformant solution would be to prefault a chunk of pages and
return short IO as soon as we hit unmapped page in the buffer. But I was
already fending off application developers complaining Linux returns short
write when you try to write more than 2 GB. So I'm *sure* we would break a
lot of userspace if we suddently start returning short writes for much
smaller requests. Crappy userspace ;)

I have a way to make things work but it doesn't make the locking simpler
from the current situation which is sad. The logic is as follows:

We will have range lock (basically a range-equivalent of current i_mutex).
Each range can also have a flag set saying that pages cannot be inserted in
a given range. 

Punch hole / truncate lock the range with the flag set.
Write & direct IO lock the range without the flag set.
Read & page fault work on per-page basis and create page for given index
only after waiting for held range locks with flag set covering that index.

This achieves:
1) Exclusion among punch hole, truncate, writes, direct IO.
2) Exclusion between punch hole, truncate and page creation in given range.
3) Exclusion among reads, page faults, writes is achieved by PageLock as
currently.

Direct IO write will have the same consistency problems with mmap as it
currently has.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html