On Tue, 30 Jan 2007, Mike Coleman wrote: > > 1. As of today, is there any real safety concern with either tool's > repo format? Is either tool significantly better in this regard? > (Keith Packard's post hints at a problem here, but doesn't really make > the case.) I think Keith was nervous about hg, because hg (a) has changed repo formats a few times and was talking about changing it again (but since I don't follow hg very closely, I don't know if that has happened, will happen, or was shelved) (b) modifies data in-place. Git doesn't really do either. Git has extended the repository format a few times (notably pack-files), but apart from a *really* early change at the very beginning of development, the git repo format is identical today to what it was originally, and you can read old repositories without any converision what-so-ever. Also, the git repository format is (and has always been) "stable" in another sense: we never *ever* re-write any old data. Even when we re-pack, we write a totally new copy, and while you'd often then get rid of duplicates afterwards, the operation is fundamentally safer that way. > 2. Does the git packed object format solve the performance problem > alluded to in posts from a year or two ago? If you mean the original discussions in the first few months of git development, then yes. People used to worry that git's unpacked format was not only slow, but also would chew up disk like mad. Both were true, and yes, both were solved by the packed format (to the point where I think git uses the *least* amount of disk space of any SCM ever made ;) HOWEVER. Git definitely has a different "performance profile" than many other SCM's do, and it's something worth keeping in mind. That has less to do with the pack-files than just very fundamental git design. In particular, *every* other SCM I am aware of does history on a per-file basis. Git very fundamentally does not. This means that while git outperforms just about anything else, if you expect "individual file history" to be any faster than "whole repository history", you're simply going to be in for a surprise. It very fundamentally isn't. We had this particular performance "anomaly" be discussed just the other week. People seem to be so used to the "file ID" mentality that has its roots in RCS etc, that they expect "git log <filename>" to somehow be faster than "git log". In git, that's simply not true. History is *always* seen as a "full repository history". There simply isn't anything else. I personally don't see this as a "problem", but it definitely is *different*. And it causes a different performance profile for various operations than you'd see with other SCM's. [ The reason I don't think this is a problem is because it's partly what makes whole-repository operations like "merge" so fast. But it's also the thing that causes git to very naturally not care about single files, and anything you can do with a single file you can basically do with an arbitrary set of files or directories. Which is *very* powerful, and as far as I know, no other SCM can effectively do that at all. As a top-level maintainer of a project with tens of thousands of files, I end up almost never looking at individual files: I look at collections of files. And that's where git shines, and almost everybody else falls flat on their face. But if you have the "single-file" mentality, you will find operations that you think git does badly. ] > 3. Someone mentioned that git bisect can work between any two > commits, not necessarily just one that happens to be an ancestor of > the other. This sounds really cool. Can hg's bisect do this, too? I suspect it can - as far as I know, the whole "bisect" thing originated with git, and hg picked up the idea from that. You'd have to be really stupid (and/or have a horrible repo format) to not be able to do multiple unrelated commits. HOWEVER! One thing that may make it less useful in hg is that last I heard, hg didn't do multiple independent branches in the same repository. So some of the more useful usage schenarios may simply not be viable in hg at all (ie you'd have to merge in order to bring the two unrelated commits into the same hg repository, and merging may not always be possible). So with git, you can say "that branch is good, this branch is bad, what caused the regression?" by using "git bisect". In hg, I'm not sure that works, simply because of the weakness of branches. But you'd have to ask the hg lists. They do have *some* concept of branches within a repo, so it may well be that it all works out. > 4. What is git's index good for? I find that I like the idea of it, > but I'm not sure I could justify it's presence to someone else, as > opposed to having it hidden in the way that hg's dircache (?) is. Can > anyone think of a good scenario where it's a pretty obvious benefit? It's a huge deal during merging with conflicts. During merging, the index is the part that shows you what the conflicts are, and also where you mark any conflict resolution while the working tree is still not fully resolved. However, it's kind of hard to show the "obvious benefit" without actually showing an example of a real (and complex) merge conflict, and I'm way too lazy for that. It has advantages in many other situations too, but they are more subtle. One of the things _I_ consider to be an advantage (but which confuses some people because it's also another thing that makes git different from many other SCM systems) is that the index is also where you "prepare" your work for committing, and this is especially obvious when adding new files. Every single SCM has *some* kind of an index, even if it's as simple as just the CVS "list of files I know about". So in CVS, the "index" is really just the "CVS/Entries" list. You really can think of the git index as just a "CVS/Entries" kind of thing, done right. So what does "done right" mean? It means that the git index not only lists the filenames, it lists their *contents* and status too. That means that when you do a "git add", you don't just add a filename to the list of files you know about, you literally add the *content*. The reason this is important is that this is fundmanetally how git works: git doesn't actually really *ever* work with filenames at any stage all, git either works with "content" (which obviously includes the notion of a filename, but it is also the mode of the file and the content of the file), and git also has a notion of "pathname limiter", which basically works on a repository "tree" level, and limits the content to just a subset of the whole tree. So the "index" is very much part of this - it's just another portion of the fact that git always tracks *contents* and never tracks "file ID's". So in CVS (or SVN), when you do a "cvs add", you really don't add any content to the repository, you are really adding a new "file ID" to the list of files that CVS/SVN tracks. In git, when you do "git add", you are really adding content, but that also means that the index - the "CVS/Entries" replacement - has to be able to track things differently. Anyway, if you come from CVS, and have worked with it intimately enough that you know how things like CVS/Entries work, it should actually be fairly easy to pick up on the git index. You just need to mentally realize "oh, it contains the contents, file mode and merge conflict state too!" > 5. I think I read that there'd been just one incompatible change over > time in the git repo format. What was it? The original git object naming was to first compress the object, and then calculate the SHA1 of the compressed end result. That was stupid, stupid, and I admit it. I switched it around. However, to get some notion of how early this was, the first git release was done on April 7, 2005. The change-over to switch the compression and SHA-1 hashing around was done April 20, 2005. There was an additional fix to do the date handling more sanely, April 23, 2005. The format has been stable since. So yes, there has been one real format change, and it happened two weeks into development, long before git was really usable by mere mortals at all. After that, we have added capabilities to the the database (notably, the packed files, and a new simplified loose object format), but as far as I know, current git will happily read any git archive written after April 23rd, 2005. With no data conversion necessary. Going the other way is obviously not always possible. If you get a git from May of 2005, and try to use it on an archive that uses pack-files, it obviously will *not* work. But even there, we've been very careful, and unless you set some specific options in your config file or do things like explicitly pack your branch head/tag references, fairly old versions of git will happily read even new archives. > 6. Does either tool use hard links? This matters to me because I do > development on a connected machine and a disconnected machine, using a > usb drive to rsync between. (Perhaps there'll be some way to transfer > changes using git or hg instead of rsync, but I haven't figured that > out yet.) I don't know about hg (but will assume not). Git generally does not, but doesn't mind them either if you have them in your working tree. And yes, there are ways to transfer using git natively, and they tend to be a lot more useful and safe than rsync. > 7. I'm a fan of Python, and I'm really a fan of using high-level > languages with performance-critical parts in a lower-level language, > so in that regard, I really like hg's implementation. If someone > wanted to do it, is a Python clone of git conceivable? Is there > something about it that just requires C? It doesn't "require" C in the sense that the object format is actually fairly simple, and you could do things natively in python if you *really* wanted. That said, the whole approach of git has always been to write the "core" core in C, and just make the thing very scriptable. Some things simply are not sensible to do in a slow interpreted language. Things like generating diffs (another name for "comparing two trees") is fundamnetally much too performance-sensitive for anything but a serious system language. You need a compiled language with a good compiler, no "byte code pre-compilers" need apply. Same goes for the "view repository through a filename filter" thing. We used to have our standard "merge" function written in python, but mainly because it was our *only* python dependency, it actually got rewritten in C (also, people - including me - really expect to merge two branches with 20+ _thousand_ files in them in less than a second, so that may explain another reason why the merge got rewritten). > 8. It feels like hg is not really comfortable with parallel > development over time on different heads within a single repo. > Rather, it seems that multiple repos are supposed to be used for this. > Does this lead to any problems? For example, is it harder or > different to merge two heads if they're in different repo than if > they're in the same repo? That is my understanding too, but I've not followed hg actively. The git branching model really is superior. It might take a while to get used to it (it took _me_ a while to get used to it ;), but once you do, everybody else so *obviously* does it so horribly badly that it's not even funny. So the whole "multiple branches in the same repo" thing really shines in git. SCM's like SVN *claim* that they do multiple branches, but they really don't. They are just confused. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html