On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote: > Hi all, > > This series creates a new FIEXCHANGE_RANGE system call to exchange > ranges of bytes between two files atomically. This new functionality > enables data storage programs to stage and commit file updates such that > reader programs will see either the old contents or the new contents in > their entirety, with no chance of torn writes. A successful call > completion guarantees that the new contents will be seen even if the > system fails. > > The ability to exchange file fork mappings between files in this manner > is critical to supporting online filesystem repair, which is built upon > the strategy of constructing a clean copy of a damaged structure and > committing the new structure into the metadata file atomically. > > User programs will be able to update files atomically by opening an > O_TMPFILE, reflinking the source file to it, making whatever updates > they want to make, and exchange the relevant ranges of the temp file > with the original file. It's probably worth noting that the "reflinking the source file" here is optional, right? IOW one can just: - open(O_TMPFILE) - write() - ioctl(FIEXCHANGE_RANGE) I suspect the "simpler" non-database cases (think e.g. editors operating on plain text files) are going to be operating on an in-memory copy; in theory of course we could identify common ranges and reflink, but it's not clear to me it's really worth it at the tiny scale most source files are. > The intent behind this new userspace functionality is to enable atomic > rewrites of arbitrary parts of individual files. For years, application > programmers wanting to ensure the atomicity of a file update had to > write the changes to a new file in the same directory More sophisticated tools already are using O_TMPFILE I would say, just with a final last step of materializing it with a name, and then rename() into place. So if this also obviates the need for https://lore.kernel.org/linux-fsdevel/364531.1579265357@xxxxxxxxxxxxxxxxxxxxxx/ that seems good. > Exchanges are atomic with regards to concurrent file opera‐ > tions, so no userspace-level locks need to be taken to obtain > consistent results. Implementations must guarantee that read‐ > ers see either the old contents or the new contents in their > entirety, even if the system fails. But given that we're reusing the same inode, I don't think that can *really* be true...at least, not without higher level serialization. A classic case today is dconf in GNOME is a basic memory-mapped database file that is atomically replaced by the "create new file, rename into place" model. Clients with mmap() view just see the old data until they reload explicitly. But with this, clients with mmap'd view *will* immediately see the new contents (because it's the same inode, right?) and that's just going to lead to possibly split reads and undefined behavior - without extra userspace serialization or locking (that more proper databases) are going to be doing. Arguably of course, dconf is too simple and more sophisticated tools like sqlite or LMDB could make use of this. (There's some special atomic write that got added to f2fs for sqlite last I saw...I'm curious if this could replace it) But still...it seems to me like there's going to be quite a lot of the "potentially concurrent reader, atomic replace desired" pattern and since this can't replace that, we should call that out explicitly in the man page. And also if so, then there's still a need for the linkat(AT_REPLACE) etc. > XFS_EXCHRANGE_TO_EOF I kept reading this as some sort of typo...would it really be too onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.? Echoes of unix "creat" here =)