Re: [PATCH v7 0/3] io_uring: add getdents64 support

Jann Horn <jannh@xxxxxxxxxx> · Mon, 3 Jan 2022 08:03:51 +0100

On Sat, Jan 1, 2022 at 8:59 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote:
> > On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote:
> > > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@xxxxxx> wrote:
> > > >
> > > > This series adds support for getdents64 in liburing. The intent is to
> > > > provide a more complete I/O interface for io_uring.
> > >
> > > Ack, this series looks much more natural to me now.
> > >
> > > I think some of the callers of "iterate_dir()" could probably be
> > > cleaned up with the added argument, but for this series I prefer that
> > > mindless approach of just doing "&(arg1)->f_pos" as the third argument
> > > that is clearly a no-op.
> > >
> > > So the end result is perhaps not as beautiful as it could be, but I
> > > think the patch series DTRT.
> >
> > It really does not.  Think what happens if you pass e.g. an odd position
> > to that on e.g. ext2/3/4.  Or just use it on tmpfs, for that matter.
>
> [A bit of a braindump follows; my apologies for reminding of some
> unpleasant parts of history]
>
> The real problem here is the userland ABI along the lines of pread for
> directories.  There's a good reason why we (as well as everybody else,
> AFAIK) do not have pgetdents(2).
>
> Handling of IO positions for directories had been causing trouble
> ever since the directory layouts had grown support for long names.
> Originally a directory had been an array of fixed-sized entries; back
> then ls(1) simply used read(2).  No special logics for handling offsets,
> other than "each entry is 16 bytes, so you want to read a multiple of
> 16 starting from offset that is a multiple of 16".
>
> As soon as FFS had introduced support for names longer than 14 characters,
> the things got more complicated - there's no predicting if given position
> is an entry boundary.  Worse, what used to have been an entry boundary
> might very well come to point to the middle of a name - all it takes is
> unlink()+creat() done since the time the position used to be OK.
>
> Details are filesystem-dependent; e.g. for original FFS all multiples of
> 512 are valid offsets, and each entry has its length stored in bytes 4
> and 5, so one needs to read the relevant 512 bytes of contents and walk
> the list of entries until they get to (or past) the position that needs
> to be validated.  For ext2 it's a bit more unpleasant, since the chunk
> size is not 512 bytes - it's a block size, i.e. might easily by 4Kb.
> For more complex layouts it gets even nastier.
>
> Having to do that kind of validation on each directory entry access would
> be too costly.  That's somewhat mitigated since the old readdir(2) is no
> longer used, and we return multiple entries per syscall (getdents(2)).
> With the exclusion between directory reads and modifications, that allows
> to limit validation to the first entry returned on that syscall.
>
> It's still too costly, though.  The next part of mitigation is to use
> the fact that valid position will remain valid until somebody modifies a
> directory, so we don't need to validate if directory hadn't been changed
> since the last time getdents(2) had gotten to this position.  Of course,
> explicit change of position since that last getdents(2) means that we
> can't skip validation.

And for this validation caching to work properly, AFAIU you need to
hold the file->f_pos_lock (or have exclusive access to the "struct
file"), which happens in the normal getdents() path via fdget_pos().
This guarantees that the readdir handler has exclusive access to the
file's ->f_version, which has to stay in sync with the position.

By the way, this makes me wonder: If I open() an ext4 directory and
then concurrently modify the directory, call readdir() on it, and
issue IORING_OP_READV SQEs with ->off == -1, can that cause ext4 to
think that filesystem corruption has occurred?

io_uring has some dodgy code that seems to be reading and writing
file->f_pos without holding the file->f_pos_lock. And even if the file
doesn't have an f_op->read or f_op->read_iter handler, I think you
might be able to read ->f_pos of an ext4 directory and write the value
back later, unless I'm missing a check somewhere?

io_prep_rw() grabs file->f_pos; then later, io_read() calls
io_iter_do_read() (which will fail with -EINVAL), and then the error
path goes through kiocb_done(), which writes the position back to
req->file->f_pos. So I think the following race might work:

First, open an FD referencing a directory on an ext4 filesystem. Read
part of the directory's entries from the FD with getdents(). Then
modify the directory such that the FD's ->f_version is now stale. Then
do the race:

task A: start a IORING_OP_READV op on the FD and let it grab the stale
offset from file->f_pos
task B: run getdents() on the FD. after this, ->f_version is
up-to-date and ->f_pos points to the start of a dentry
task A: continue handling the IORING_OP_READV: io_iter_do_read()
fails, kiocb_done() overwrites the valid ->f_pos with the stale value
while leaving ->f_version up-to-date

Afterwards, the file might have an ->f_pos pointing in the middle of a
dentry while ->f_version indicates that it is valid (meaning it's
supposed to point to the start of a dentry). If you then do another
getdents() call on the FD, I think you might get garbage back and/or
make the kernel think that the filesystem is corrupted (depending on
what's stored at that offset)?