On Wed, Aug 18, 2021 at 11:58:18AM +0200, Jan Kara wrote: > On Wed 18-08-21 06:24:26, Amir Goldstein wrote: > > [...] > > > > > > Just keep in mind that the current scheme pre-allocates the single event slot > > > > on fanotify_mark() time and (I think) we agreed to pre-allocate > > > > sizeof(fsnotify_error_event) + MAX_HDNALE_SZ. > > > > If filesystems would want to store some variable length fs specific info, > > > > a future implementation will have to take that into account. > > > > > > <nod> I /think/ for the fs and AG metadata we could preallocate these, > > > so long as fsnotify doesn't free them out from under us. > > > > fs won't get notified when the event is freed, so fsnotify must > > take ownership on the data structure. > > I was thinking more along the lines of limiting maximum size for fs > > specific info and pre-allocating that size for the event. > > Agreed. If there's a sensible upperbound than preallocating this inside > fsnotify is likely the least problematic solution. > > > > For inodes... > > > there are many more of those, so they'd have to be allocated > > > dynamically. > > > > The current scheme is that the size of the queue for error events > > is one and the single slot is pre-allocated. > > The reason for pre-allocate is that the assumption is that fsnotify_error() > > could be called from contexts where memory allocation would be > > inconvenient. > > Therefore, we can store the encoded file handle of the first erroneous > > inode, but we do not store any more events until user read this > > one event. > > Right. OTOH I can imagine allowing GFP_NOFS allocations in the error > context. At least for ext4 it would be workable (after all ext4 manages to > lock & modify superblock in its error handlers, GFP_NOFS allocation isn't > harder). But then if events are dynamically allocated there's still the > inconvenient question what are you going to do if you need to report fs > error and you hit ENOMEM. Just not sending the notification may have nasty > consequences and in the world of containerization and virtualization > tightly packed machines where ENOMEM happens aren't that unlikely. It is > just difficult to make assumptions about filesystems overall so we decided > to be better safe and preallocate the event. > > Or, we could leave the allocation troubles for the filesystem and > fsnotify_sb_error() would be passed already allocated event (this way > attaching of fs-specific blobs to the event is handled as well) which it > would just queue. Plus we'd need to provide some helper to fill in generic > part of the event... > > The disadvantage is that if there are filesystems / callsites needing > preallocated events, it would be painful for them. OTOH current two users - > ext4 & xfs - can handle allocation in the error path AFAIU. > > Thinking about this some more, maybe we could have event preallocated (like > a "rescue event"). Normally we would dynamically allocate (or get passed > from fs) the event and only if the allocation fails, we would queue the > rescue event to indicate to listeners that something bad happened, there > was error but we could not fully report it. Yes. > But then, even if we'd go for dynamic event allocation by default, we need > to efficiently merge events since some fs failures (e.g. resulting in > journal abort in ext4) lead to basically all operations with the filesystem > to fail and that could easily swamp the notification system with useless > events. Hm. Going out on a limb, I would guess that the majority of fs error flood events happen if the storage fails catastrophically. Assuming that a catastrophic failure will quickly take the filesystem offline, I would say that for XFS we should probably send one last "and then we died" event and stop reporting after that. > Current system with preallocated event nicely handles this > situation, it is questionable how to extend it for online fsck usecase > where we need to queue more than one event (but even there probably needs > to be some sensible upper-bound). I'll think about it... At least for XFS, I was figuring that xfs_scrub errors wouldn't be reported via fsnotify since the repair tool is already running anyway. > > > Hmm. For handling accumulated errors, can we still access the > > > fanotify_event_info_* object once we've handed it to fanotify? If the > > > user hasn't picked up the event yet, it might be acceptable to set more > > > bits in the type mask and bump the error count. In other words, every > > > time userspace actually reads the event, it'll get the latest error > > > state. I /think/ that's where the design of this patchset is going, > > > right? > > > > Sort of. > > fsnotify does have a concept of "merging" new event with an event > > already in queue. > > > > With most fsnotify events, merge only happens if the info related > > to the new event (e.g. sb,inode) is the same as that off the queued > > event and the "merge" is only in the event mask > > (e.g. FS_OPEN|FS_CLOSE). > > > > However, the current scheme for "merge" of an FS_ERROR event is only > > bumping err_count, even if the new reported error or inode do not > > match the error/inode in the queued event. > > > > If we define error event subtypes (e.g. FS_ERROR_WRITEBACK, > > FS_ERROR_METADATA), then the error event could contain > > a field for subtype mask and user could read the subtype mask > > along with the accumulated error count, but this cannot be > > done by providing the filesystem access to modify an internal > > fsnotify event, so those have to be generic UAPI defined subtypes. > > > > If you think that would be useful, then we may want to consider > > reserving the subtype mask field in fanotify_event_info_error in > > advance. > > It depends on what exactly Darrick has in mind but I suspect we'd need a > fs-specific merge helper that would look at fs-specific blobs in the event > and decide whether events can be merged or not, possibly also handling the > merge by updating the blob. Yes. If the filesystem itself were allowed to manage the lifespan of the fsnotify error event object then this would be trivial -- we'll own the object, keep it updated as needed, and fsnotify can copy the contents to userspace whenever convenient. (This might be a naïve view of fsnotify...) > From the POV of fsnotify that would probably > mean merge callback in the event itself. But I guess this needs more > details from Darrick and maybe we don't need to decide this at this moment > since nobody is close to the point of having code needing to pass fs-blobs > with events. <nod> We ... probably don't need to decide this now. --D > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR