Re: Weird writev() behaviour on EFAULT - also successfully modifying the file

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Thu, 11 Aug 2016 23:10:53 +0100

On Thu, Aug 11, 2016 at 09:46:51PM +0300, Tuomas Tynkkynen wrote:
> Greetings,
> 
> I've noticed a corner case with writev() both modifying the file and
> returning -EFAULT at the same time. This happens on filesystems using
> generic_perform_write() (i.e. ext4, vfat) on 4.6.3 kernel and below,
> down to 3.16. Here's the reproducer:

[fill 8Kb with 'A', then reopen the file and do writev() on two-element
iovec array consisting of {NULL,0},{NULL,4096}]

> Running that prints "writev(): Bad address" but also some NUL bytes have
> appeared at the beginning file, in addition to the 'A's by the dd.
> 
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00001000  41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41  |AAAAAAAAAAAAAAAA|
> *
> 00002000
> 
> Is that intented behaviour?

There are two issues mixed here: one is that iovec array (NULL/0, NULL/4K)
we get
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
        if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
                char __user *buf = i->iov->iov_base + i->iov_offset;
                bytes = min(bytes, i->iov->iov_len - i->iov_offset);
                return fault_in_pages_readable(buf, bytes);
        }
        return 0;
}
succeed.  It shouldn't.  However, even if it correctly failed, we still have
another one - __copy_from_user_inatomic() zeroes what it hadn't copied over,
so if we e.g. gave write(2) a buffer with unmapped page in the middle, we
would get a short write *and* a bunch of bytes past the amount reported
written replaced with zeroes.  And _that_ goes very far back.  2.1.late
definitely had already been that way.

Note, BTW, that fixing iov_iter_fault_in_readable() won't exclude a possibility
of write() returning -EFAULT and overwriting some bytes with zeroes; all you
need is munmap() racing with write() and hitting after the successful fault-in.

We could, in principle, provide a non-zeroing variants of copy_from_user and
copy_from_user_inatomic and switch ->write()/->write_iter() instances to
use of those; however, I'm not sure if it's really worth bothering with.

I do see one potential argument in favour of doing that; suppose we have
a buffer filled with 'A'.  It starts one byte before a page boundary and all
its pages are currently in swap.  There's a file also filled with 'A' and
mmapped by another process.  We say write() from our buffer to that file.
Having that other process observe transient zeroes appearing in its mmapped
area is Not Nice(tm); having them stick around is even worse, and that can
happen in case if the first process gets killed while it's faulting the next
page of buffer in.

Linus?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html