Re: Ascertaining amount of "original" code across files/repo

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 23 Oct 2017 11:04:49 +0900

Thomas Adam <thomas@xxxxxxxxxx> writes:

> What I did was first of all ascertain the number of original lines in each of
> the files I was interested in:
>
> 	for i in *.[ch]
> 	do
> 		c="$(git --no-pager blame "$i" | grep -c '^\^')"
> 		[ $c -gt 0 ] && echo "$i:$c"
> 	done | sort -t':' -k2 -nr

Another approach I've used when I was curious how many among 1244
lines Linus originally wrote for Git in 2005 remains in today's
codebase goes the other way [*1*].

The "reverse" approach makes use of the -S option of blame to
fabricate a hypothetical history where the very initial version of
Git is today's version, and then there is another version that was
built on it (eh, rather reduced out of it) which is Linus's
original.

	$ git tag initial e83c5163316f89
	$ cat >fake-history <<EOF
	$(git rev-parse initial) $(git rev-parse master)
	$(git rev-parse master)
	EOF

The list of files that Linus had in his original can befound out
with:

	$ git ls-tree -r --name-only initial

and you can iterate over them with a command like this:

	$ git blame -Sfake-history -s -b initial -- cache.h

a brief commentary of the options:

 * "-Sfake-history" option points at a fake-history file, which uses
   the same format as the "graft" file, to establish the fake
   ancestry.  The first line claims that the Linus's 'initial'
   version has only one parent, which is our current version
   'master' (in reality, Linus's 'initial' version did not have any
   parent, of course).  The second line claims that our current
   version 'master' is a root commit without any parent.

 * "-s" squelches all metainformation other than commit object name
   from the prefix of each line; "-b" further blanks out the commit
   object name of the "root" commit---note that in this fake
   history, our current state in 'master' is what is blanked out.

The output may start like so:

                     1) #ifndef CACHE_H
                     2) #define CACHE_H
                     3) 
        e83c5163316  4) #include <stdio.h>
        e83c5163316  5) #include <sys/stat.h>
        e83c5163316  6) #include <fcntl.h>
        e83c5163316  7) #include <stddef.h>

The idea is that a line that is blamed to the "root" commit
(i.e. blank prefix) is what survived since Linus's version down to
our current version.  In the fake world, Linus started from our
today's version and ended up with the same result in his version for
these lines.  A line that is blamed to e83c516 is something we do
not have in our today's version that is "added" by Linus in this
fake world---that in reality is what we "lost" from Linus's original
over time.

By adding -M and -C on "git blame" command line, you'll find more
lines that survived over time from Linus's original by getting moved
around inside the same file and across file boundaries.  By adding -w,
indentation-only changes would also be ignored.

I am not judging which is more correct to go in the forward
direction like your approach does or to go in the reverse, as I
haven't thought about it deeply enough.

[Reference]

*1* https://docs.google.com/file/d/0Bw3FApcOlPDhMFR3UldGSHFGcjQ/view

    Slide #11 was created using the above method.