Re: diff'ing files ...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 06, 2011 at 06:49:04PM +0000, Albretch Mueller wrote:

>  I could not find specific information on what diff'ing algorithm does
> git use. Any white papers about git internals, mostly about its
> diff'ing and synching algorithm? (and I am not scared at all of
> "Directed Acyclic Graph" or any kind of Math terminology; in fact, I
> love it ;-))

It depends on what level you want to talk about. Each commit has a tree
state represented by a single "tree object". Each tree object is a list
of paths mapping to more objects, each of which may be a binary stream
of data ("blob") or another tree object (i.e., a subtree).

Diffing commits means diffing their trees. Diffing trees means looking
for blobs at the same path in each commit and diffing them[1]. How we
diff the content in the blobs depends on:

  1. any external diff configuration (e.g., GIT_EXTERNAL_DIFF in the
     environment, or a per-file diff driver; see "git help attributes").
     In that case, we just show whatever output the external diff
     produces.

  2. We use xdiff by default, which does a line-oriented diff using the
     Myers algorithm (just like most "diff" implementations). See
     xdiff/xdiffi.c in the git source, or Myers "An O(ND) Difference
     Algorith and its Variations".

  3. If you give the --patience option, we use the patience algorithm
     instead:

       http://bramcohen.livejournal.com/73318.html

     That's the inventor's original description, but you can find other
     discussion by googling for "patience diff".

  4. If you use --word-diff (or --color-words), we find differences by
     line, and then break lines down by configurable delimiters into
     words, and then do a Myers diff on the resulting list of words.

  5. If files are not text, we usually just say "Binary files differ"
     (like most diff implementations). We can also generate binary
     diffs, though I don't know offhand the details of the algorithm.

I think that more or less covers 2-way diffing. There's also a 3-way
comparison when merging, both at the tree level and at the blob level.
Syncing actually has very little to do with our regular diffing. We
generate compact binary deltas (just like in (5) above) between objects,
regardless of their path, and send the other side a set of needed
objects. This is the same format used for compact "packfile" storage.
If the last bit of that paragraph didn't make any sense, then read up on
git's object model in something like "Pro Git":

  http://progit.org/book/ch9-0.html

-Peff

[1] Actually, tree comparisons are a little more complex than I made
them out to be. For instance, a path might exist on only one side (which
we usually show as a comparison to an empty blob), or might exist as a
tree on one side and a blob on the other. We also look for renames of
paths at the blob level by looking for similar blobs.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]