On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@xxxxxxx> wrote: > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote: [...] > > Hi Jan, > > > > I wanted to give an update on the POC that I am working on. > > I decided to find a FUSE HSM and show how it may be converted > > to use fanotify HSM hooks. > > > > HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local > > cache from a remote http on first access to every directory and file range. > > > > Normally, it would be run like this: > > ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/ > > > > Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements > > passthrough calls to the local cache dir if cache is already populated. > > > > After my conversion patches [1], this download-only HSM can be run like > > this without mounting FUSE: > > > > sudo ./httpdirfs --fanotify --cache-location /vdf/cache > > https://cdn.kernel.org/pub/ - > > > > [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content > > > > Browsing the cache directory at /vdf/cache, lazyly populates the local cache > > using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content > > using FAN_ACCESS_PERM read hooks. > > > > Up to this point, the implementation did not require any kernel changes. > > However, this type of command does not populate the path components, > > because lookup does not generate FAN_ACCESS_PERM event: > > > > stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > > > > To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM > > event [2] and used it to lazyly populate directories in the path ancestry. > > For now, I stuck with the XXX_PERM convention and did not require > > FAN_CLASS_PRE_CONTENT, although we probably should. > > > > [2] https://github.com/amir73il/linux/commits/fanotify_pre_content > > > > Streaming read of large files works as well, but only for sequential read > > patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events > > do not (yet) carry range info, so my naive implementation downloads > > one extra data chunk on each FAN_ACCESS_PERM until the cache file is full. > > > > This makes it possible to run commands like: > > > > tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > > | less > > > > without having to wait for the entire 400MB file to download before > > seeing the first page. > > > > This streaming feature is extremely important for modern HSMs > > that are often used to archive large media files in the cloud. > > Thanks for update Amir! I've glanced through the series and so far it looks > pretty simple and I'd have only some style / readability nits (but let's > resolve those once we have something more complete). > > When thinking about HSM (and while following your discussion with Dave) I > wondered about one thing: When the notifications happen before we take > locks, then we are in principle prone to time-to-check-time-to-use races, > aren't we? How are these resolved? > > For example something like: > We have file with size 16k. > Reader: Writer > read 8k at offset 12k > -> notification sent > - HSM makes sure 12-16k is here and 16-20k is beyond eof so nothing to do > > expand file to 20k > - now the file contents must not get moved out until the reader is > done in order not to break it > Hi Jan, It's been a while since I updated this topic. I have been making progress on the wiki and POC, but it's not done yet. I would like to poke your brain about my proposed solutions for the TOCTOU race issues, because the solution is subtle and you may have better ideas to suggest. > > > Questions: > > - What do you think about the direction this POC has taken so far? > > - Is there anything specific that you would like to see in the POC > > to be convinced that this API will be useful? > > I think your POC is taking a good direction and your discussion with Dave > had made me more confident that this is all workable :). I liked your idea > of the wiki (or whatever form of documentation) that summarizes what we've > discussed in this thread. That would be actually pretty nice for future > reference. > The current state of POC is that "populate of access" of both files and directories is working and "race free evict of file content" is also implemented (safely AFAIK). The technique involving exclusive write lease is discussed at [1]. In a nutshell, populate and evict synchronize on atomic i_writecount and this technique can be implemented with upstream UAPIs. I did use persistent xattr marks for the POC, but this is not a must. Evictable inode marks would have worked just as well. Now I have started to work on persistent change tracking. For this, I have only kernel code, only lightly tested, but I did not prove yet that the technique is working. The idea that I started to sketch at [2] is to alternate between two groups. When a change is recorded, an evictable ignore mark will be added on the object. To start recording changes from a new point in time (checkpoint), a new group will be created (with no ignore marks) and the old group will be closed. The core of the algorithm is the "safe handover" between groups. This requires two infrastructure additions. The first is FAN_MARK_SYNC [3] as described in commit message: --- Synchronous add of mark or remove/flush of marks with ignore mask provides a method for safe handover of event handling between two groups: - First, group A subscribes to some events with FAN_MARK_SYNC - Then, group B unsubscribes from those events This method guarantees that any event that both groups subscribed to, will be delivered to either group or to both of them. Note that FAN_MARK_SYNC provides no synchronization to the object interest masks, which are checked outside srcu read side. Therefore, this method does not provide any guarantee regarding delivery of events which only one of the groups is subscribed to. For example, if only group B was subscribed to FAN_OPEN_EXEC and only group A is subscribing only to FAN_OPEN, an execution of a binary file may not deliver FAN_OPEN_EXEC to group B nor FAN_OPEN to group A. --- The second is to overlap fsnotify_mark_srcu read side with sb_start_write(), for pre modify permission events [4] as described in commit message: --- fsnotify: acquire sb write access inside pre modify permission event For pre modify permission events, acquire sb write access before leaving SRCU and return >0 to signal that sb write access was acquired. This can be used to implement safe "handover" of pre modify permission events between two fanotify groups: - First, group A subscribes to pre modify events with FAN_MARK_SYNC - Then, a freeze/thaw cycle is performed on the filesystem - Finally, group B unsubscribes from those events This method guarantees that a pre modify event that both groups subscribed to will be delivered to either group or to both of them. In case that the pre modify event is delivered only to group B, the freeze/thaw cycle guarantees that the filesystem modification that followed that pre modify event was also completed, before the handover is complete and group B can be closed. For pre rename permission event, acquire sb write access after the second of the event pair (i.e. rename to) was authorized. --- What do you think about this handover technique? Do you think that it is workable or do you see any major flaws in it? Would you use a different or an additional synchronization primitive instead of abusing fsnotify_mark_srcu? To clarify, the race that I am trying to avoid is: 1. group B got a pre modify event and recorded the change before time T 2. The actual modification is performed after time T 3. group A does not get a pre modify event, so does not record the change in the checkpoint since T Thanks, Amir. [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#invalidating-local-cache [2] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#tracking-local-modifications [3] https://github.com/amir73il/linux/commits/fan_mark_sync [4] https://github.com/amir73il/linux/commits/fan_modify_perm