On Sat, Jan 1, 2022 at 8:59 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote: > > On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote: > > > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@xxxxxx> wrote: > > > > > > > > This series adds support for getdents64 in liburing. The intent is to > > > > provide a more complete I/O interface for io_uring. > > > > > > Ack, this series looks much more natural to me now. > > > > > > I think some of the callers of "iterate_dir()" could probably be > > > cleaned up with the added argument, but for this series I prefer that > > > mindless approach of just doing "&(arg1)->f_pos" as the third argument > > > that is clearly a no-op. > > > > > > So the end result is perhaps not as beautiful as it could be, but I > > > think the patch series DTRT. > > > > It really does not. Think what happens if you pass e.g. an odd position > > to that on e.g. ext2/3/4. Or just use it on tmpfs, for that matter. > > [A bit of a braindump follows; my apologies for reminding of some > unpleasant parts of history] > > The real problem here is the userland ABI along the lines of pread for > directories. There's a good reason why we (as well as everybody else, > AFAIK) do not have pgetdents(2). > > Handling of IO positions for directories had been causing trouble > ever since the directory layouts had grown support for long names. > Originally a directory had been an array of fixed-sized entries; back > then ls(1) simply used read(2). No special logics for handling offsets, > other than "each entry is 16 bytes, so you want to read a multiple of > 16 starting from offset that is a multiple of 16". > > As soon as FFS had introduced support for names longer than 14 characters, > the things got more complicated - there's no predicting if given position > is an entry boundary. Worse, what used to have been an entry boundary > might very well come to point to the middle of a name - all it takes is > unlink()+creat() done since the time the position used to be OK. > > Details are filesystem-dependent; e.g. for original FFS all multiples of > 512 are valid offsets, and each entry has its length stored in bytes 4 > and 5, so one needs to read the relevant 512 bytes of contents and walk > the list of entries until they get to (or past) the position that needs > to be validated. For ext2 it's a bit more unpleasant, since the chunk > size is not 512 bytes - it's a block size, i.e. might easily by 4Kb. > For more complex layouts it gets even nastier. > > Having to do that kind of validation on each directory entry access would > be too costly. That's somewhat mitigated since the old readdir(2) is no > longer used, and we return multiple entries per syscall (getdents(2)). > With the exclusion between directory reads and modifications, that allows > to limit validation to the first entry returned on that syscall. > > It's still too costly, though. The next part of mitigation is to use > the fact that valid position will remain valid until somebody modifies a > directory, so we don't need to validate if directory hadn't been changed > since the last time getdents(2) had gotten to this position. Of course, > explicit change of position since that last getdents(2) means that we > can't skip validation. And for this validation caching to work properly, AFAIU you need to hold the file->f_pos_lock (or have exclusive access to the "struct file"), which happens in the normal getdents() path via fdget_pos(). This guarantees that the readdir handler has exclusive access to the file's ->f_version, which has to stay in sync with the position. By the way, this makes me wonder: If I open() an ext4 directory and then concurrently modify the directory, call readdir() on it, and issue IORING_OP_READV SQEs with ->off == -1, can that cause ext4 to think that filesystem corruption has occurred? io_uring has some dodgy code that seems to be reading and writing file->f_pos without holding the file->f_pos_lock. And even if the file doesn't have an f_op->read or f_op->read_iter handler, I think you might be able to read ->f_pos of an ext4 directory and write the value back later, unless I'm missing a check somewhere? io_prep_rw() grabs file->f_pos; then later, io_read() calls io_iter_do_read() (which will fail with -EINVAL), and then the error path goes through kiocb_done(), which writes the position back to req->file->f_pos. So I think the following race might work: First, open an FD referencing a directory on an ext4 filesystem. Read part of the directory's entries from the FD with getdents(). Then modify the directory such that the FD's ->f_version is now stale. Then do the race: task A: start a IORING_OP_READV op on the FD and let it grab the stale offset from file->f_pos task B: run getdents() on the FD. after this, ->f_version is up-to-date and ->f_pos points to the start of a dentry task A: continue handling the IORING_OP_READV: io_iter_do_read() fails, kiocb_done() overwrites the valid ->f_pos with the stale value while leaving ->f_version up-to-date Afterwards, the file might have an ->f_pos pointing in the middle of a dentry while ->f_version indicates that it is valid (meaning it's supposed to point to the start of a dentry). If you then do another getdents() call on the FD, I think you might get garbage back and/or make the kernel think that the filesystem is corrupted (depending on what's stored at that offset)?