On Fri, Apr 6, 2012 at 10:23, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: > On Sat, Apr 7, 2012 at 12:13 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: >> On Fri, Apr 6, 2012 at 08:44, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: >>> On Fri, Apr 6, 2012 at 10:24 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote: >>>> But even so: do we make any promises that (say) git-add is atomic in the >>>> sense that a reader always gets the before-update results or the >>>> after-update results? Non-builtins (e.g. git add -p) may make small >>>> incremental updates to the index, so they wouldn't be atomic anyway. >>> >>> Take git-checkout. I'm ok with it writing to worktree all old entries, >>> or all new ones, but please not a mix. >> >> Why, what is the big deal? git-checkout has already written the file >> to the local working tree. Its now just reflecting the updated stat >> information in the index. If it does that after each file was touched, >> and it aborts, you still have a partially updated working tree and the >> index will show some updated files as stat clean, but staged relative >> to HEAD. I don't think that is any better or worse than the current >> situation where the working tree is shown as locally dirty but the >> index has no staged files. Either way you have an aborted checkout to >> recover from by retrying, or git reset --hard HEAD. >> >> In the retry case, checkout actually has less to do because the files >> it already cleanly updated match where its going, and thus it doesn't >> have to touch them again. > > OK, what about git-commit? If I read your description correctly, you > can update entry sha-1 in place too. Yes. > Running cache-tree on half old > half new index definitely creates a broken commit. How is that possible? Each tree also has its own SHA-1 field. A process trying to update a tree's SHA-1 will have to snapshot the tree's contents from the index by copying the data into its own memory buffer so it can compute the canonical tree data buffer, write the object to the repository, and get the tree's SHA-1. It writes that tree's SHA-1 back to the index as of that snapshot. If there were concurrent updates at the same time as git commit running, its the same race condition that already exists. You don't know exactly where in the execution of `git commit` it takes the snapshot of the index that it uses to make the commit by opening the file. Allowing in place updates means the snapshot time within git commit expands to be a larger portion of its running time. Basically I would argue it is already not safe to be modifying the index while git commit is running. You don't know if git commit has already opened the index file, or will open it after the edit. The only way to be sure right now is to make your own copy of the index and use GIT_INDEX_FILE environment variable to make sure git commit uses the exact index you want. > A command can also read (which does not require lock), update its > internal index, then lock and write. At that time, it may accidentally > overwrite whatever another command wrote while it was still preparing > the index in memory. This hypothetical command already has the bug you mention. It should be fixed no matter what we do with regards to the index format. The *only* safe way to update the index and prevent losing modifications made by another process is to lock the index *then* read it, update, write back out. If you read before you take the write lock, you can discard edits made by another process. This is preciously the reason why the JGit library always opens, reads, then closes the index anytime the process wants to access an entry. We need to make sure we are viewing the correct current version. Its even more critical when the process wants to update the index, it *must* discard any in-memory cached data it has and re-read the index *after* the write lock has been successfully acquired. IMHO the risks to the update in place approach is a few things, but none of them really are a problem: * Readers *must* use the retry algorithm when looking at each record anytime the CRC-32 on an individual entry doesn't match. Retry requires using some form of backoff, because the concurrent writer needs to be given time to finish the writes to the storage file. If a reader doesn't correctly implement a retry, they could see corruption. * Readers *must* check the CRC-32 of any entry. In fact the best way to read an entry is memcpy() the entry's stat/SHA-1/CRC-32 from the index into another memory buffer, compute the checksum there, and compare. This way the reader can be certain the entry isn't mutated after it checked the CRC-32 but before it examined a particular stat field. Again a buggy implementation reading from the index might not implement this strategy and complain about corruption, or silently process data with corruption. * A partial write will leave a corrupted index. E.g. a process writing a record is killed before it has a chance to fully write out the record's data. Nobody can read that record until it is repaired. Repair should be possible with a combination of git reset --soft to copy the SHA-1 from HEAD and recomputing the working tree's SHA-1 to see if the file is really clean or not. It probably isn't, and the stat data will reflect it as dirty after the repair. We may have to put this sort of repair logic into `git status` and `git diff` as part of the normal "fix clean stat" pass. * Appending conflicting stage information to the end of the file during a merge can be risky. The append might be partial. This can be fixed by the user by `git reset --hard HEAD` to abort the merge. A partial append is probably only likely when the git merge aborted anyway and hasn't even really left you with a sane state to try and resolve conflicts in. * Truncating away the conflicting stage information on the end of the file can be risky, if the file system doesn't truncate back correctly. But I think we can detect this and repair. If every record has a "conflict" bit set to 0 and all records CRC-32s are valid, and we hold the write lock, we know any conflict data on the end is bogus and should be truncated away, so we truncate again. If truncation isn't working correctly on this filesystem, we rewrite the entire index file. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html