On 01/25, Junio C Hamano wrote: > Thomas Gummerer <t.gummerer@xxxxxxxxx> writes: > > > On 01/24, Junio C Hamano wrote: > >> To put it differently, if a blob stored at path has CRLF line > >> endings and .gitattributes is changed after the fact to say that it > >> must have LF line endings, we should treat it as a broken transitory > >> state. > > > > Right, I wasn't considering this as a broken state, because t0025 uses > > just this to transition between the states. > > > >> There may have to be a way to "fix" an already "wrong" blob > >> in the index that is milder than "rm --cached && add .", but I do > >> not think write_entry(), which is shared by all the normal codepaths > >> that writes out to the working tree, is the best place to do so, if > >> doing so forces the contents of the paths to be always re-checked, > >> just in case the user is in such a broken transitory state. > > > > Maybe I'm misunderstanding something, but the contents of the paths > > are only re-checked if we are in such a broken transition state, and > > What I do not understand here is how the added check ensures that > "only if in such a broken transition state". would_convert_to_git() > does not take the contents but is called only with the pathname to > key into the attributes, so in a typical cross platform project > where all the source files are "text" and the repository can set > core.eol to adjust the end of line convention for its working tree, > the added check has no way to differentiate the paths that are > recorded with CRLF line endings in the object database by mistake, > i.e. the ones in the broken transitory state, and all the other > paths that are following the "text" and storing their blobs with LF > line endings. The added check would trigger "is it racy" check to > re-reads the contents that we have written out from the working tree > for the paths with "wrong" blobs, but how would it avoid doing so > for the paths whose blobs are already stored correctly? > > The new code affects not just "reset --hard", but everybody who > writes out from the object database to the working tree and records > that these paths are checked out in the index. How does the new > code avoid destroying the performance for all paths? I misunderstood the way would_convert_to_git() works, I should have actually read the code, instead of just relying on my test, which was even wrong. Sorry about the confusion, my patch does indeed hurt the performance. > > the file stored in git has crlf line endings, and thus would be > > normalized. In this case we currently re-check the contents of that > > file anyway, at least when the file and the index have the same mtime, > > and we actually show the correct output. > > The right way to at that "correct output", I think, is that it > happens to be shown that way by accident. It is questionable that > it is correct to report that such a path is modified. Immediately > after you check out a path to the working tree out of the index, via > "reset --hard" and "checkout $path", by definition it ought to be > the same between the working tree file and the index. > > Unless it is in this transititory broken state, that is. > > The "by accident" happens only because racy-git avoidance is being > implemented in one particular way. Is about protecting people from > making changes to the working tree files immediately after their > previous contents are registered to the index (and the index files > written to the disk), and immediately after we write things out of > the index and by definition the result and the indexed blob ought to > match there is no reason to re-read and recheck unless the working > tree files are further edited. > > The current way the racy-git avoidance works by re-reading and > re-checking the contents when the timestamps match is merely one > possible implementation, and that is the only thing that produces > your "correct" output most of the time in this test. We could have > waited after writing the index time for a second before giving the > control back to the user instead, which would have also allowed us > to implement the racy-git avoidance, and in that alternate world, > all these transitory broken paths would have been correctly reported > as unmodified. > > IOW, I would think the test in question is insisting a single > outcome for an operation whose result is undefined, and it is > harmful to twist the system by pessimizing the common cases just > to cater to this transititory broken state. > > I am not saying that we shouldn't have support for users to fix > their repository and get out of this transititory broken state. A > recent work by Torsten Bögershausen to have ls-files report the end > of line convention used in the blob in the index and the settings > that affect conversion for each path (among other things) is a step > in the right direction. With a support like that, those who noticed > that they by mistake added CRLF files to the index as-is when they > wanted their project to be cross platform can recover from it by > setting necessary attributes (i.e. mark them as "text") and then > find paths that are broken out of "ls-files --eol" output to see > which ones are not using lf end-of-line in the index. > > I do not think there is a canned command to help dealing with these > broken paths right now. You would have to check them out of the > index (you would get a CRLF file in the working tree in the example > we are discussing), fix the line endings (you would run dos2unix on > it in this example, as you would want "text=true" attribute) and > "git add" them to recover manually, but I can imagine that Torsten's > work can be extended to do all of these, without molesting the > working tree files, with minimum work by the end user. That is: > > * Reuse Torsten's "ls-files --eol" code to find paths that record > the blob in the index that does not follow the eol convention > specified for the path; > > * For each of these index entries, run convert_to_working_tree() on > the indexed contents, and then on the result of it, run > convert_to_git(). The result is the blob that the index ought to > have had, if it were to be consistent with the attribute > settings. So add that to the index. > > * Write the index out. > > * Tell the user to commit or commit it automatically with a canned > log message "fix broken encoding" or something. Thanks for the thorough explanation, and the patch in the next email, I'll have a look at that tomorrow. -- Thomas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html