Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Sun, May 20, 2012 at 8:28 PM, manish honap > <manish_honap_vit@xxxxxxxxxxx> wrote: >> Hello Linus, >> >> The overflow issue was seen during async dio path > > Christ. fs/aio.c doesn't do the proper rw_verify_area(). > > As a result, it doesn't check file locks, and it doesn't seem to check > offset overflows either. > > The vector versions kind of get the size limit by mistake (because > they at least use rw_copy_check_uvector(), which does limit things to > MAX_RW_COUNT), but they don't do the offset overflow check either. > > Does this patch work for you? What it *should* do is the same that the > other read/write paths do (and the vector path for aio already do), > namely truncate reads or writes to MAX_RW_COUNT (which is INT_MAX > aligned down to a page). OK, this took me a while to wrap my head around, mostly due to confusion with ki_nbytes and ki_left. When doing p{read|write}v, aio_nbytes is used to store the number of vectors. Thus, we need to set both ki_left and ki_nbytes to the result of the rw_verify_area call, which means that when ki_left goes to 0, we'll exit out of this loop: do { ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg], iocb->ki_nr_segs - iocb->ki_cur_seg, iocb->ki_pos); if (ret > 0) aio_advance_iovec(iocb, ret); /* retry all partial writes. retry partial reads as long as its a * regular file. */ } while (ret > 0 && iocb->ki_left > 0 && (opcode == IOCB_CMD_PWRITEV || (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)))); and here: /* This means we must have transferred all that we could */ /* No need to retry anymore */ if ((ret == 0) || (iocb->ki_left == 0)) ret = iocb->ki_nbytes - iocb->ki_left; ki_nbytes will remain at whatever rw_verify_area returned, and ki_left should be zero. Meaning, if we passed in 2G or more, we're capped at the value Linus mentioned earlier. In the case of a single vector, we leave ki_left alone: > - kiocb->ki_iovec->iov_len = kiocb->ki_left; > + kiocb->ki_iovec->iov_len = bytes; Thus, we will exit the while loop due to ret == 0, but this time, ki_nbytes will be the initial value, and ki_left will be non-zero. The end result is that the amount of data transferred is returned, so all is well. Of course, none of that matters for O_DIRECT I/O, since the return value is -EIOCBQUEUED. And, since buffered AIO isn't even asynchronous, nobody /should/ be doing that anyway, so even if the above was broken, nobody /should/ notice. Nevertheless, I did some targeted testing of both the O_DIRECT and the buffered cases, and it seems to be working fine for the single and multiple vector cases. I do have one question, though: why do we limit the total bytes in a vectored operation to 2GB? So long as each segment is below 2G, we shouldn't trip up any code paths in the kernel, right? So that just leaves the "it would take a long time and burn kernel resources" argument. So I guess that's the reason? Anyway, consider this a long-winded: Reviewed-by: Jeff Moyer <jmoyer@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html