On Wed, Oct 21, 2009 at 3:49 PM, Bernie Innocenti <bernie@xxxxxxxxxxx> wrote: > And here's the the catch: the history of individual files is not > directly represented in a git repository. It is typically scattered > across thousands of commit objects, with no direct links to help find > them. If you want to retrieve the log of a file that was changed only 6 > times in the entire history of the Linux kernel, you'd have to dig > through *all* of the 170K revisions in the "master" branch. > > And it takes some time even if git is blazingly fast: > > bernie@giskard:~/src/kernel/linux-2.6$ time git log --pretty=oneline REPORTING-BUGS | wc -l > 6 > > real 0m1.668s > user 0m1.416s > sys 0m0.210s > > (my laptop has a low-power CPU. A fast server would be 8-10x faster). > > > Now, the English Wikipedia seems to have slightly more than 3M articles, > with--how many? tenths of millions of revisions for sure. Going through > them *every time* one needs to consult the history of a file would be > 100x slower. Tens of seconds. Not acceptable, uh? I think this slowness could be overcome using a simple cache of filename -> commitid list, right? That is, you run some variant of "git log --name-only" and, for each file changed by each commit, add an element to the commit list for that file. When committing in the future, use a hook that updates the cache. When you want to view the history of a particular file, simply retrieve exactly the list of commits in that file's commitlist, not other commits. It sounds like such a cache could be implemented quite easily outside of git itself. Would that help? That said, I'll bet you find other performance glitches when you import millions of files and tens/hundreds of millions of commits. But we probably won't know what those problems are until someone imports them :) Have fun, Avery -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html