Notes on diffcore API

Junio C Hamano <junkio@xxxxxxx> · Tue, 27 Jun 2006 00:28:03 -0700

I really should have done this six months ago, but I guess being
late is much better than never...

Somebody might want to do proper asciidoc and throw it in
Documentation/technical/.

-- >8 --
Notes on diffcore API
=====================

The diff generation mechanism in git is designed to be
self-contained and you should be able to call it as set of
library functions, unlike other parts of the system where "we
run once and let exit() to clean up afterwards" mentality is
dominant.  This is mainly because from early on "diff-tree
--stdin" needed to be able to process hundreds of parent-child
tree pairs without leaking.

NOTE: this note does not describe how combined diff works.  It
is quite a different animal.

The diffcore machinery works in 4 phases:

 1. setting up the machinery, including command line parsing.
 2. feeding input pairs to the machinery.
 3. letting it munge input pairs.
 4. flushing the output.

The first phase sets up the operational parameters (e.g. use of
rename detector, output format).  The second phase feeds pairs
of 'old' and 'new' files to the machinery as the front-end finds
them (e.g. diff-files compares each path found in the index and
in the working tree, and feeds the information from the index as
'old' and the information from the working tree as 'new';
diff-tree compares entries in two trees or entries in the tree
of the first parent commit and the tree of the commit).  In the
third phase, the pairs collected in the previous phase are
split, matched up, and filtered to form a different set of
pairs.  The last phase formats the resulting set of pairs for
the output.

Setting it up
-------------

The diffcore machinery takes one structure, `struct diff_options`,
to record the set of options that affects its behaviour.  These
options affect different parts of its operation, but can roughly
be classified into three groups: the ones that affects how the
input set of pairs are transformed in phase 3, and the ones that
affects how the resulting set of pairs are formatted in phase 4.

First call diff_setup() to initialize the diff_options
structure.  This gives the minimum default set of options:

 - output format defaults to --raw format;
 - output lines are LF terminated;
 - no diffcore transformation (phase 3) is used;

If you are writing a top-level diff command, you can then call
diff_opt_parse() to parse the common diff options and fill the
information in diff_options structure, but if your usage does
not require end-user customizability, you can set up the fields in
diff_options yourself without calling this function.

Then call diff_setup_done() -- this makes sure the set of
options are consistent and derives a reasonable default
(e.g. --find-copies-harder without -C does not make sense, patch
output is always recursive).

Feeding Input
-------------

Your main program feeds 'file pairs' to the diffcore machinery
by using these three functions: diff_addremove(), diff_change()
and diff_unmerge().  The first one records a path appears not in
'old' tree but in 'new' tree (or vice versa), the second one
records a path is different between 'old' and 'new', and the
third one says the comparison is meaningless for the path
because it is unmerged (this is only used by diff-index and
diff-files).

When you want to do something diff-tree does, which is quite
common, you can give two tree object names to diff_tree_sha1()
function and let it walk the trees and call these functions for
you.

To signal the end of input, call `diffcore_std()`.  This
starts the diffcore transformation described next.

Diffcore Transformation
-----------------------

The input file pairs recorded in the previous phase are
collected in diff_queued_diff (a global variable -- which means
that you cannot have two diffs running in parallel with the
current setup).  This is an expandable array of pointers to
`struct diff_filepair` structure.

The `struct diff_filepair` structure has (as the name suggests)
two pointers to `struct diff_filespec` to record the 'old' and
the 'new' file in this pair (the old one is called 'one', and
the new one 'two'), along with some information used by various
diffcore transformation.  `struct diff_filespec` records the
blob object name, pathname, size and mode among other things.
Two things to watch out for are:

 - a non-existent path is denoted by mode=0 (e.g. in a filepair
   for a deleted file, one->mode != 0 and two->mode == 0).

 - 0{40} SHA-1 is used when the filespec talks about the file in
   the working tree.

Documentation/diffcore.txt should be consulted for the details
of what each transformation does.  A short version:

 - diffcore-break breaks a filepair that modifies 'one' to 'two'
   into two filepairs that deletes 'one' and creates 'two' if
   'one' and 'two' are sufficiently dissimilar.

 - diffcore-rename matches up a filepair that deletes 'one' and
   another filepair that creates 'two' and makes them into one
   filepair.

 - diffcore-merge-broken picks up two filepairs that were
   originally one but broken by diffcore-break but did not get
   matched up by diffcore-rename.

 - diffcore-pickaxe filters out filepairs whose 'one' and 'two'
   have the same number of occurrences of the specified string.

 - diffcore-order reorders the resulting filepairs according to
   the given input.

 - after all of the above, --diff-filter is applied to remove
   the uninteresting classes of output (e.g. --diff-filter=A
   shows only additions).

Flushing output
---------------

After diffcore transformation runs, the result is still in the
same diff_queued_diff variable.  Before calling the standard
output routines, you can inspect each file pair in the queue to
see its status (e.g. what renames to what).

Especially interesting is the 'status' field of the filepair
structure.  At this point in the processing chain, each file
pair is marked with `M` (modified), `C` (copied), `R` (renamed),
etc.

Once you are done, calling diff_flush() to perform the output
and free the data structure.  If you ran the diff primarily
because you wanted to read the diff_queued_diff and you do not
want any output from this phase, set the putput_format to
DIFF_FORMAT_NO_OUTPUT before calling diff_flush() -- otherwise
you would leak memory the big way.

The raw, patch-text, diffstat and summary output all happens in
this final phase; recent work to make --patch, --stat, --raw
etc.  independent flags by Timo are primarily about phase 2 and
this phase.

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html