On Fri, Aug 28, 2020 at 08:24:57AM +1000, Dave Chinner wrote: > On Thu, Aug 27, 2020 at 04:22:07PM +0100, Matthew Wilcox wrote: > > On Mon, Aug 17, 2020 at 10:29:30AM +1000, Dave Chinner wrote: > > > To implement ADS, we'd likely consider adding a new physical inode > > > "ADS fork" which, internally, maps to a separate directory > > > structure. This provides us with the ADS namespace for each inode > > > and a mechanism that instantiates a physical inode per ADS. IOWs, > > > each ADS can be referenced by the VFS natively and independently as > > > an inode (native "file as a directory" semantics). Hence existing > > > create/unlink APIs work for managing ADS, readdir() can list all > > > your ADS, you can keep per ADS xattrs, etc.... > > > > > > IOWs, with a filesystem inode fork implementation like this for ADS, > > > all we really need is for the VFS to pass a magic command to > > > ->lookup() to tell us to use the ADS namespace attached to the inode > > > rather than use the primary inode type/state to perform the > > > operation. > > > > > > Hence all the ADS support infrastructure is essentially dentry cache > > > infrastructure allowing a dentry to be both a file and directory, > > > and providing the pathname resolution that recognises an ADS > > > redirection. Name that however you want - we've got to do an on-disk > > > format change to support ADS, so we can tell the VFS we support ADS > > > or not. And we have no cares about existing names in the filesystem > > > conflicting with the ADS pathname identifier because it's a mkfs > > > time decision. Given that special flags are needed for the openat() > > > call to resolve an ADS (e.g. O_ALT), we know if we should parse the > > > ADS identifier as an ADS the moment it is seen... > > > > I think this is equivalent to saying "Linux will never support ADS". > > Al has some choice words on having the dentry cache support objects which > > are both files and directories. You run into some "fun" locking issues. > > And there's lots of things you just don't want to permit, like mounting > > a new filesystem on top of some ADS, or chrooting a process into an ADS, > > or renaming an ADS into a different file. > > I know all this. My point is that the behaviour required by ADS > objects is that of a seekable data file. That requires a struct file > that points at a struct inode, page cache mapping, etc to all work > as they currently do. It also means that how ADS are managed and > presented to userspace is entirely a VFS construct. Indeed, > everything you mention above is functionality controlled/implemented > by the VFS via the dentry cache... I agree with you that supporting named streams within a file requires an independent inode for each stream. I disagree with you that this is dentry cache infrastructure. I do not believe in giving each stream its own dentry. Either they share the default stream's dentry, or they have no dentry (mild preference for no dentry). > > I think what would satisfy people is allowing actual "alternate data > > streams" to exist in files. You always start out by opening a file, > > then the presentation layer is a syscall that lets you enumerate the > > data streams available for this file, and another syscall that returns > > an fd for one of those streams. > > You could do this with a getdents_at() syscall that has an AT_ALT > flag or something like that. i.e. iterate the streams on the inode > (whether it be a regular file or a directory!) and report them as > dirents to userspace. Userspace can then openat2(fd, name, O_ALT) > and there is your user API. Maybe. getdents is a little overkill; these things don't have inode numbers (at least not ones which are meaningful to userspace), or d_type. I might be tempted by just read() on an fd like v7 unix. > The VFS can deal with openat2(fd, stream_name, O_ALT) however it > wants - it doesn't need the dentry cache pathwalk here - just vector > straight to the filesystem's ->lookup mechanism on the inode > attached to the "dirfd" passed in. > > AFAICT, the dentry cache only needs to be involved if we want to > -cache- the ADS namespace. I don't think we need to cache the ADS > namespace as long as the inode is cached by the filesystem - just > let the fs and let it do an inode cache lookup and instantiation for > ADS inodes (eg as XFS already does for internal inode accesses > during bulkstat, quotacheck, etc). We don't cache the xattr > namespaces in the VFS - the filesystem is responsible for doing that > if required - so I don't think this would be a problem for ADS > access... > > The fact that ADS inodes would not be in the dentry cache and hence > not visible to pathwalks at all then means that all of the issues > such as mounting over them, chroot, etc don't exist in the first > place... Wait, you've now switched from "this is dentry cache infrastructure" to "it should not be in the dentry cache". So I don't understand what you're arguing for. > > As long as nobody gets the bright idea to be able to link that fd into > > the directory structure somewhere, this should avoid any problems with > > unwanted things being done to an ADS. Chunks of your implementation > > described above should be fine for this. > > I can see the need for rename and linkat linking O_TMPFILE fd's into > ADS names, though. e.g. to be able to do safe overwrites of ADS > data. I don't have a problem with being able to create unnamed streams and then atomically linking them into their containing file. > From a fs management POV, we'll also want to be able to do things > like defrag ADS inodes, which means we'll need to be able to do > atomic inode operations (e.g. swap extents) between O_TMPFILE inodes > and ADS inodes, etc. So in addition to the VFS interfaces, there's a > bunch of filesystem admin stuff that will need to be made ADS aware, > and it's likely there will be fs specific ioctls that need to be > modifed/added to manipulate ADS inodes directly... Yes, probably. > > For the benefit of shell scripts, I think an argument to 'cat' to open > > an ADS and an lsads command should be enough. > > > > Oh, and I would think we might want i_blocks of the 'host' inode to > > reflect the blocks allocated to all the data streams attached to the > > inode. That should address at least parts of the data exfiltration > > concern. > > I think that's a problem, because metadata blocks that are invisible > to userspace are also accounted to the inode block count, so a user > cannot know if the difference between the data file size and the > block count stat() reports is block mapping metadata, xattrs, > speculative delayed allocation reservations, etc. It's just not a > useful signal because it's already so overloaded with invisible > stuff.... My concern is that 'du' should not have to be made stream-aware to continue to be accurate. Yes, all these other things also contribute to the space being used by a file, so it's not a very reliable signal, but if you see a vast discrepancy (several gigabytes being used by a file which is notionally a few hundred bytes), it's suspicious. > It also means that every block map modification to an ADS inode also > has to lock and modify the host inode. That's going to mean adding a > heap of complexity to the filesystem transaction models because now > there are two independent inodes that have to be locked we doing a > single inode operations instead of largely being a simple drop in... It doesn't have to be reflected in the on-disk inode. As long as the calling stat() returns the number of blocks allocated to all streams contained in the file, you can implement that any way you want. > IOWs, if ADS visibility is required (which I don't think anyone will > argue against) I'd suggest that statx() has a flag added to indicate > ADS exist on the inode. Then it's easy to discover through a > standard interface.... The amount of space used has to be visible to unmodified utilities. We could have an implementation where unmodified utilities walk all the sub-streams at stat() time while statx() with the appropriate flag reports disaggregated data (and is more efficient). I think we have a group of people contributing to this thread who want the plain "named streams" functionality that you and I are currently discussing. And then another group who want something more complex where the "alternate" contents of the file could be a directory tree with files and subdirectories and permissions ... essentially mounting the contents of a ZIP file on top of itself. And I think that's a level of complexity we have to step away from.