Re: [PATCH 1/1] ext4, dio: Remove overflow for size >2G in aio-dio code.

Jeff Moyer <jmoyer@xxxxxxxxxx> · Tue, 22 May 2012 16:41:02 -0400

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sun, May 20, 2012 at 8:28 PM, manish honap
> <manish_honap_vit@xxxxxxxxxxx> wrote:
>> Hello Linus,
>>
>> The overflow issue was seen during async dio path
>
> Christ. fs/aio.c doesn't do the proper rw_verify_area().
>
> As a result, it doesn't check file locks, and it doesn't seem to check
> offset overflows either.
>
> The vector versions kind of get the size limit by mistake (because
> they at least use rw_copy_check_uvector(), which does limit things to
> MAX_RW_COUNT), but they don't do the offset overflow check either.
>
> Does this patch work for you? What it *should* do is the same that the
> other read/write paths do (and the vector path for aio already do),
> namely truncate reads or writes to MAX_RW_COUNT (which is INT_MAX
> aligned down to a page).

OK, this took me a while to wrap my head around, mostly due to confusion
with ki_nbytes and ki_left.  When doing p{read|write}v, aio_nbytes is
used to store the number of vectors.  Thus, we need to set both ki_left
and ki_nbytes to the result of the rw_verify_area call, which means that
when ki_left goes to 0, we'll exit out of this loop:

	do {
		ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg],
			    iocb->ki_nr_segs - iocb->ki_cur_seg,
			    iocb->ki_pos);
		if (ret > 0)
			aio_advance_iovec(iocb, ret);

	/* retry all partial writes.  retry partial reads as long as its a
	 * regular file. */
	} while (ret > 0 && iocb->ki_left > 0 &&
		 (opcode == IOCB_CMD_PWRITEV ||
		  (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode))));

and here:

	/* This means we must have transferred all that we could */
	/* No need to retry anymore */
	if ((ret == 0) || (iocb->ki_left == 0))
		ret = iocb->ki_nbytes - iocb->ki_left;

ki_nbytes will remain at whatever rw_verify_area returned, and ki_left
should be zero.  Meaning, if we passed in 2G or more, we're capped at
the value Linus mentioned earlier.

In the case of a single vector, we leave ki_left alone:

> -	kiocb->ki_iovec->iov_len = kiocb->ki_left;
> +	kiocb->ki_iovec->iov_len = bytes;

Thus, we will exit the while loop due to ret == 0, but this time,
ki_nbytes will be the initial value, and ki_left will be non-zero.  The
end result is that the amount of data transferred is returned, so all is
well.

Of course, none of that matters for O_DIRECT I/O, since the return value
is -EIOCBQUEUED.  And, since buffered AIO isn't even asynchronous,
nobody /should/ be doing that anyway, so even if the above was broken,
nobody /should/ notice.  Nevertheless, I did some targeted testing of
both the O_DIRECT and the buffered cases, and it seems to be working
fine for the single and multiple vector cases.

I do have one question, though: why do we limit the total bytes in a
vectored operation to 2GB?  So long as each segment is below 2G, we
shouldn't trip up any code paths in the kernel, right?  So that just
leaves the "it would take a long time and burn kernel resources"
argument.  So I guess that's the reason?

Anyway, consider this a long-winded:

Reviewed-by: Jeff Moyer <jmoyer@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html