On Wed 16-11-22 18:24:06, Amir Goldstein wrote: > > > Why then give up on the POST_WRITE events idea? > > > Don't you think it could work? > > > > So as we are discussing, the POST_WRITE event is not useful when we want to > > handle crash safety. And if we have some other mechanism (like SRCU) which > > is able to guarantee crash safety, then what is the benefit of POST_WRITE? > > I'm not against POST_WRITE, I just don't see much value in it if we have > > another mechanism to deal with events straddling checkpoint. > > > > Not sure I follow. > > I think that crash safety can be achieved also with PRE/POST_WRITE: > - PRE_WRITE records an intent to write in persistent snapshot T > and add to in-memory map of in-progress writes of period T > - When "checkpoint T" starts, new PRE_WRITES are recorded in both > T and T+1 persistent snapshots, but event is added only to > in-memory map of in-progress writes of period T+1 > - "checkpoint T" ends when all in-progress writes of T are completed So maybe I miss something but suppose the situation I was mentioning few emails earlier: PRE_WRITE for F -> F recorded as modified in T modify F POST_WRITE for F PRE_WRITE for F -> ignored because F is already marked as modified -> checkpoint T requested, modified files reported, process modified files modify F --------- crash Now unless filesystem freeze or SRCU is part of checkpoint, we will never notify about the last modification to F. So I don't see how PRE + POST_WRITE alone can achieve crash safety... And if we use filesystem freeze or SRCU as part of checkpoint, then processing of POST_WRITE events does not give us anything new. E.g. synchronize_srcu() during checkpoing before handing out list of modified files makes sure all modifications to files for which PRE_MODIFY events were generated (and thus are listed as modified in checkpoint T) are visible for userspace. So am I missing some case where POST_WRITE would be more useful than SRCU? Because at this point I'd rather implement SRCU than POST_WRITE. > The trick with alternating snapshots "handover" is this > (perhaps I never explained it and I need to elaborate on the wiki [1]): > > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query > > The changed files query results need to include recorded changes in both > "finalizing" snapshot T and the new snapshot T+1 that was started in > the beginning of the query. > > Snapshot T MUST NOT be discarded until checkpoint/handover > is complete AND the query results that contain changes recorded > in T and T+1 snapshots have been consumed. > > When the consumer ACKs that the query results have been safely stored > or acted upon (I called this operation "bless" snapshot T+1) then and > only then can snapshot T be discarded. > > After snapshot T is discarded a new query will start snapshot T+2. > A changed files query result includes the id of the last blessed snapshot. > > I think this is more or less equivalent to the SRCU that you suggested, > but all the work is done in userspace at application level. > > If you see any problem with this scheme or don't understand it > please let me know and I will try to explain better. So until now I was imagining that query results will be returned like a one big memcpy. I.e. one off event where the "persistent log daemon" hands over the whole contents of checkpoint T to the client. Whatever happens with the returned data is the bussiness of the client, whatever happens with the checkpoint T records in the daemon is the daemon's bussiness. The model you seem to speak about here is somewhat different - more like readdir() kind of approach where client asks for access to checkpoint T data, daemon provides the data record by record (probably serving the data from its files on disk), and when the client is done and "closes" checkpoint T, daemon's records about checkpoint T can be erased. Am I getting it right? This however seems somewhat orthogonal to the SRCU idea. SRCU essentially serves the only purpose - make sure that modifications to all files for which we have received PRE_WRITE event are visible in respective files. > > > > The technical problem I see is how to deal with AIO / io_uring because > > > > SRCU needs to be released in the same context as it is acquired - that > > > > would need to be consulted with Paul McKenney if we can make it work. And > > > > another problem I see is that it might not be great to have this > > > > system-wide as e.g. on networking filesystems or pipes writes can block for > > > > really long. > > > > > > > > Final question is how to expose this to userspace because this > > > > functionality would seem useful outside of filesystem notification space so > > > > probably do not need to tie it to that. > > > > > > > > Or we could simplify our life somewhat and acquire SRCU when generating > > > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU > > > > within fsnotify and would mitigate the problems coming from system-wide > > > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and > > > > POST_WRITE would not for some reason. Just branstorming here, I've not > > > > really decided what's better... > > Seems there are several non trivial challenges to surmount with this > "userspace modification SRCU" idea. > > For now, I will stay in my comfort zone and try to make the POC > with PRE/POST_WRITE work and write the proof of correctness. > > I will have no objection at all if you figure out how to solve those > issues and guide me to a path for implementing sb_write_srcu. > It will make the userspace implementation much simpler, getting rid > of the in-progress writes in-memory tracking. It seems you have progressed on this front yourself so let's continue there :). Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR