A generalization of git blame

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Hi,

I have been developing my git tool (based on the git internal API) that
can find out all the commits that have changed a line for better
authorship.

The reason is for my binary code authorship research, I use machine
learning to classify code authorship. To produce training data, I start
with a source code repository with well-known author labels for each line
and then compiling the project into binary. So, I am able to know the
authorship for binary code and then apply some machine learning
techniques.

To get ground truth of authorship for each line, I start with git-blame.
But later I find this is not sufficient because the last commit may only
add comments or may only change a small part of the line, so that I
shouldn't attribute the line of code to the last author. Of course, there
must be some debates on who can be the representative of a line of code.
So what I would like to do is find out all the commits that have ever
changed a line, then I can try different approaches to summarize over all
these commits to produce my final authorship label (or even tuple).

I was wondering whether there have been similar debates over accurate
authorship in this community before and whether there may be other people
interested in this work.

Thanks

--Xiaozhu

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]