Re: [PATCH v7 0/3] io_uring: add getdents64 support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote:
> On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote:
> > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@xxxxxx> wrote:
> > >
> > > This series adds support for getdents64 in liburing. The intent is to
> > > provide a more complete I/O interface for io_uring.
> > 
> > Ack, this series looks much more natural to me now.
> > 
> > I think some of the callers of "iterate_dir()" could probably be
> > cleaned up with the added argument, but for this series I prefer that
> > mindless approach of just doing "&(arg1)->f_pos" as the third argument
> > that is clearly a no-op.
> > 
> > So the end result is perhaps not as beautiful as it could be, but I
> > think the patch series DTRT.
> 
> It really does not.  Think what happens if you pass e.g. an odd position
> to that on e.g. ext2/3/4.  Or just use it on tmpfs, for that matter.

[A bit of a braindump follows; my apologies for reminding of some
unpleasant parts of history]

The real problem here is the userland ABI along the lines of pread for
directories.  There's a good reason why we (as well as everybody else,
AFAIK) do not have pgetdents(2).

Handling of IO positions for directories had been causing trouble
ever since the directory layouts had grown support for long names.
Originally a directory had been an array of fixed-sized entries; back
then ls(1) simply used read(2).  No special logics for handling offsets,
other than "each entry is 16 bytes, so you want to read a multiple of
16 starting from offset that is a multiple of 16".

As soon as FFS had introduced support for names longer than 14 characters,
the things got more complicated - there's no predicting if given position
is an entry boundary.  Worse, what used to have been an entry boundary
might very well come to point to the middle of a name - all it takes is
unlink()+creat() done since the time the position used to be OK.

Details are filesystem-dependent; e.g. for original FFS all multiples of
512 are valid offsets, and each entry has its length stored in bytes 4
and 5, so one needs to read the relevant 512 bytes of contents and walk
the list of entries until they get to (or past) the position that needs
to be validated.  For ext2 it's a bit more unpleasant, since the chunk
size is not 512 bytes - it's a block size, i.e. might easily by 4Kb.
For more complex layouts it gets even nastier.

Having to do that kind of validation on each directory entry access would
be too costly.  That's somewhat mitigated since the old readdir(2) is no
longer used, and we return multiple entries per syscall (getdents(2)).
With the exclusion between directory reads and modifications, that allows
to limit validation to the first entry returned on that syscall.

It's still too costly, though.  The next part of mitigation is to use
the fact that valid position will remain valid until somebody modifies a
directory, so we don't need to validate if directory hadn't been changed
since the last time getdents(2) had gotten to this position.  Of course,
explicit change of position since that last getdents(2) means that we
can't skip validation.

Another fun part is synthetic filesystems like tmpfs - there we don't
have any persistent directory contents we could use as source of offsets.
All we have is a cyclic list of dentries; memory address of dentry is
obviously not a candidate - trying to validate _that_ would be beyond
unpleasant.  So we use the position in the list.  To avoid the O(directory
size) cost of walking the list to the position we are asked to read from,
we insert a cursor into the list and have directory reads and seeks move
it as needed.  File position is not authoritative there - the cursor is.
They can get out of sync - if you read almost to the end of directory,
unlink the first file, use lseek(fd, SEEK_CUR, 0) to find position, then
lseek to 0 and back to the position reported, you'll end up one entry
further than you would without those lseeks.  Userland is generally OK
with that (and we are within the POSIX warranties).  However, if that
kind of "out of sync" can happen without any directory modifications
involved, we have trouble.

Directories are not well-suited for random access.  Not since mid-80s.
It's possible, but not cheap and there's a lot of non-obvious corner cases
where directory modifications are involved.  Some of those must work,
or the userland will break.  Telling which ones those are is not trivial.

pgetdents(2) would be a bad idea by itself; making it asynchronous is
the stuff of nightmares.  Please, reconsider that ABI - AFAICS, there's
no way to do it cheaply and it will take modifying each ->iterate_dir()
just to avoid breaking the existing uses of those.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux