On Tue, 12 Jan 2021, Taylor Blau wrote: > > > ++ > > > +*NOTE*: this operation can race with concurrent modification to the > > > +source repository, similar to running `cp -r src dst` while modifying > > > +`src`. > > Couldn't `gc` be triggered by git in seemingly read-only operations, > > thus possibly ruining the analogy with `cp` while doing `rm` (explicit > > intent to modify)? > > Moreover, situation is also a bit different since a sane user script > > would not place `rm` into background to keep operating on original > > source right before doing `cp` -- and that is what is happening here: > If you're suggesting that something is missing from the above patch, I'm > not sure I quite understand what you would like added. Slept on it. I think your patch (doc disclaimer) is factually correct and probably as good as it can get. Not yet sure if it is worth explicit mentioning `gc` or `repack` as one of such concurrent operations. > All of these (background gc, explicit rm-ing) fall under the category of > "concurrent modification": they are changing the source directory in > some way while a read operation is taking place. yes. My comment was more on how such modifications are triggered: via explicit actions (e.g. `rm`) intended to modify vs as a "house keeping running in the background", which is the case of gc in particular when triggered by seemingly read-only operations. > > `git` operation is presumably complete (but leaves `gc` running in the > > background) and script advances to the next step only to run into a race > > condition with that preceding `git` command which apparently triggered > > `gc`. Should then any script which operates on local `git` repositories > > not to forget to add -c gc.autodetach=0 for every git > > invocation which might be potentially effected? > If your workflow is that you are frequently cloning via the local > transport and there is no other synchronization going on between > whatever work is happening in the source repository, then yes. (But note > of course that you can set gc.autodetach=0 via the source repository's > .git/config rather than typing it each time). IMHO it affects efficiency, become cumbersome (for git users), and thus might be error-prone: e.g. gc.autodetach=0 is necessity only to mitigate only for a possible subsequent `clone` invocation operating locally. Higher level constructs siting on top of `git` would not know what is the next command ran in the user script (like in our case of datalad) to set such config variable for their invocations. Adding gc.autodetach=0 to every single `git` invocation would effect our efficiency. User might not be made aware of such necessity for using `git clone` on local repositories, only after having their scripts deployed and at some random points in time start hitting the race condition and go "google" and RTFM mode to figure out what is going on. That is why I am more in-line with your initial comment in https://lore.kernel.org/git/X%2FipCPFyW3gAWrHo@nand.local/ : > Perhaps Git could take some sort of lock when writing to the object > store, but an flock wouldn't work since we'd want to allow multiple > readers to acquire the lock simultaneously, so long as there is no > writer. I think it would be nice to have `clone_local()` first check that there is no ongoing modifications happening before proceeding and wait some reasonable amount of time (up to ?0 sec?) if still ongoing, and then fail "informatively" if still cannot clone. Even though it would not prevent race condition in full (`clone_local` might check and initiate, and then some process starts altering while `clone_local` is ongoing), it would mitigate any scripted cases of a local `git clone` following some heavy manipulations of original repository which triggers background gc. -- Yaroslav O. Halchenko Center for Open Neuroscience http://centerforopenneuroscience.org Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 WWW: http://www.linkedin.com/in/yarik