Re: [Foundation-l] Wikipedia meets git

Nicolas Pitre <nico@xxxxxxxxxxx> · Wed, 21 Oct 2009 17:05:48 -0400 (EDT)

On Wed, 21 Oct 2009, Bernie Innocenti wrote:

> And here's the the catch: the history of individual files is not
> directly represented in a git repository. It is typically scattered
> across thousands of commit objects, with no direct links to help find
> them. If you want to retrieve the log of a file that was changed only 6
> times in the entire history of the Linux kernel, you'd have to dig
> through *all* of the 170K revisions in the "master" branch.
> 
> And it takes some time even if git is blazingly fast:
> 
>  bernie@giskard:~/src/kernel/linux-2.6$ time git log  --pretty=oneline REPORTING-BUGS  | wc -l 
>  6
> 
>  real	0m1.668s
>  user	0m1.416s
>  sys	0m0.210s
> 
> (my laptop has a low-power CPU. A fast server would be 8-10x faster).
> 
> 
> Now, the English Wikipedia seems to have slightly more than 3M articles,
> with--how many? tenths of millions of revisions for sure. Going through
> them *every time* one needs to consult the history of a file would be
> 100x slower. Tens of seconds. Not acceptable, uh?
> 
> It seems to me that the typical usage pattern of an encyclopedia is to
> change each article individually. Perhaps I'm underestimating the role
> of bots here. Anyway, there's no consistency *requirement* for mass
> changes to be applied atomically throughout all the encyclopedia, right?

You certainly don't need to put all files in the same tree then.  
Having the whole thing split according to some sections that are 
unlikely to overlap would be the way to go.  Therefore you could arrange 
subsections to have their own branches with no other files in them, or 
even rely on Git submodules.  The partitioning doesn't necessarily have 
to be one of the two extremes such as one branch per file à la CVS or 
all files in the same branch/tree as Git does by default.

Nicolas