From: Rutger Nijlunsing <rutger@xxxxxxxxxx>
To: Linus Torvalds <torvalds@xxxxxxxx>
Cc: Junio C Hamano <junkio@xxxxxxx>, git@xxxxxxxxxxxxxxx
Bcc:
Subject: Re: Fix up diffcore-rename scoring
Reply-To: git@xxxxxxxxxxxxxxxxx
In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@xxxxxxxxxxx>
Organization: M38c
On Mon, Mar 13, 2006 at 07:38:53AM -0800, Linus Torvalds wrote:
>
>
> On Mon, 13 Mar 2006, Junio C Hamano wrote:
> >
> > By the way, the reason the diffcore-delta code in "next" does
> > not do every-eight-bytes hash on the source material is to
> > somewhat alleviate the problem that comes from not detecting
> > copying of consecutive byte ranges.
>
> Yes. However, there are better ways to do that in practice.
>
> The most effective way that is generally used is to not use a fixed
> chunk-size, but use a terminating character, together with a
> minimum/maximum chunksize.
>
> There's a pretty natural terminating character that works well for
> sources: '\n'.
>
> So the natural way to do similarity detection when most of the code is
> line-based is to do the hashing on chunks that follow the rule "minimum of
> <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".
>
> So if you don't see any '\n' at all (or the only such one is less than <n>
> bytes into your current window), do the hash over a <2n>-byte chunk (this
> takes care of binaries and/or long lines).
>
> This - for source code - allows you to ignore trivial byte offset things,
> because you have a character that is used for synchronization. So you
> don't need to do hashing at every byte in both files - you end up doing
> the hashing only at line boundaries in practice. And it still _works_ for
> binary files, although you effectively need bigger identical chunk-sizes
> to find similarities (for text-files, it finds similarities of size <n>,
> for binaries the similarities need to effectively be of size 3*n, because
> you chunk it up at ~2*n, and only generate the hash at certain offsets in
> the source binary).
This looks like something I did last year as an experiment in the
pre-git times. The idea was to generate a patch-with-renames from two
(large) source trees.
Algorithm:
- determine md5sum for each file (same idea as git's SHA1 sum)
if changed since last run
- only look at md5sums which do not match
- pool files into types, which might depend on extension and/or MIME type.
This is an optimisation.
- Only compare filepair _within_ one pool.
- The filepair order in one pool is determined by filename-similarity.
So pair [include/asm-ppc/ioctl.h, include/asm-powerpc/ioctl.h]
is inspected before pair
[include/asm-ppc/ioctl.h, arch/arm/plat-omap/clock.h] .
- For each file, create a hash from String line -> Integer occuranced .
Similarities are calculated by comparing two hashes.
- Keep as a rename-match all files which:
- have at most 50% new lines;
- have at most 25% lines deleted from them.
I ran the code against v2.6.12 and v2.6.14 to be able to compare it
with the current contenders. Hopefully some ideas are harvestable...
Algorithm differences:
- '\n' is used as boundary, independant on line length.
This is bad for binary files, and maybe even bad for text files.
So don't harvest :)
- don't look at the intersection percentage, but look at two values:
- percentage of lines added (default: max. 50%)
- percentage of lines removed (default: max. 25%)
This assumes files get bigger during development (at most 50%), and
not too much code is deleted (at most 25%).
Disadvantages:
- Two magic numbers instead of one.
- It's non-symmetrical. Diff A->B will find different renames from
diff B->A. This scares me, actually.
- to speed up the detection:
- don't start comparing files at random. Start comparing files which
have the same 'names' in it. So when v2.6.12 has a files called
arch/arm/mach-omap/clock.c, start comparing with files which have
most words the same. Currently, '-', '.', '_' and '/' are used
as word separators.
Advantage: don't match on the first match just above the
match-threshold.
(next heuristics are all optional:)
- only compare files with the same extension. This splits up all files
into groups, which makes it much faster.
In general, there's no reason to compare a .h with a .c file.
- only compare files with the same MIME type. Same as above, but also
works for files without extensions (so don't compare README with
Makefile)
Ok, the result:
$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | wc -l
104 <-- That's bad. We're missing some renames here.
$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | sort -k 1.10
+ 0% -23% arch/arm/configs/omnimeter_defconfig -> arch/arm/configs/collie_defconfig
+ 5% - 9% arch/arm/mach-omap/board-generic.c -> arch/arm/mach-omap1/board-generic.c
+ 0% - 8% arch/arm/mach-omap/board-h2.c -> arch/arm/mach-omap1/board-h2.c
+ 0% - 5% arch/arm/mach-omap/board-h3.c -> arch/arm/mach-omap1/board-h3.c
+ 0% - 3% arch/arm/mach-omap/board-innovator.c -> arch/arm/mach-omap1/board-innovator.c
+ 0% - 9% arch/arm/mach-omap/board-netstar.c -> arch/arm/mach-omap1/board-netstar.c
+ 9% -10% arch/arm/mach-omap/board-osk.c -> arch/arm/mach-omap1/board-osk.c
+ 0% - 6% arch/arm/mach-omap/board-perseus2.c -> arch/arm/mach-omap1/board-perseus2.c
+ 3% - 8% arch/arm/mach-omap/board-voiceblue.c -> arch/arm/mach-omap1/board-voiceblue.c
+ 7% - 4% arch/arm/mach-omap/clock.c -> arch/arm/plat-omap/clock.c
+ 0% - 0% arch/arm/mach-omap/clock.h -> arch/arm/plat-omap/clock.h
+ 0% - 5% arch/arm/mach-omap/common.h -> include/asm-arm/arch-omap/common.h
+ 2% - 1% arch/arm/mach-omap/dma.c -> arch/arm/plat-omap/dma.c
+ 0% - 1% arch/arm/mach-omap/fpga.c -> arch/arm/mach-omap1/fpga.c
+11% -11% arch/arm/mach-omap/gpio.c -> arch/arm/plat-omap/gpio.c
+ 2% - 2% arch/arm/mach-omap/irq.c -> arch/arm/mach-omap1/irq.c
+ 0% - 4% arch/arm/mach-omap/leds.c -> arch/arm/mach-omap1/leds.c
+ 0% - 0% arch/arm/mach-omap/leds-h2p2-debug.c -> arch/arm/mach-omap1/leds-h2p2-debug.c
+ 0% - 0% arch/arm/mach-omap/leds-innovator.c -> arch/arm/mach-omap1/leds-innovator.c
+ 0% - 4% arch/arm/mach-omap/leds-osk.c -> arch/arm/mach-omap1/leds-osk.c
+ 0% -25% arch/arm/mach-omap/Makefile.boot -> arch/arm/mach-omap1/Makefile.boot
+ 1% - 2% arch/arm/mach-omap/mcbsp.c -> arch/arm/plat-omap/mcbsp.c
+ 0% - 6% arch/arm/mach-omap/mux.c -> arch/arm/plat-omap/mux.c
+ 0% - 0% arch/arm/mach-omap/ocpi.c -> arch/arm/plat-omap/ocpi.c
+ 1% -18% arch/arm/mach-omap/pm.c -> arch/arm/plat-omap/pm.c
+ 0% -11% arch/arm/mach-omap/sleep.S -> arch/arm/plat-omap/sleep.S
+ 6% - 4% arch/arm/mach-omap/time.c -> arch/arm/mach-omap1/time.c
+ 0% - 1% arch/arm/mach-omap/usb.c -> arch/arm/plat-omap/usb.c
+ 2% - 1% arch/ia64/sn/include/pci/pcibr_provider.h -> include/asm-ia64/sn/pcibr_provider.h
+ 0% - 2% arch/ia64/sn/include/pci/pic.h -> include/asm-ia64/sn/pic.h
+ 0% - 0% arch/ia64/sn/include/pci/tiocp.h -> include/asm-ia64/sn/tiocp.h
+ 3% -23% arch/m68knommu/platform/68VZ328/de2/config.c -> arch/m68knommu/platform/68VZ328/config.c
+ 1% -18% arch/mips/configs/osprey_defconfig -> arch/mips/configs/qemu_defconfig
+ 0% -12% arch/mips/vr41xx/zao-capcella/setup.c -> arch/mips/vr41xx/common/type.c
+ 0% - 0% arch/ppc64/oprofile/op_impl.h -> include/asm-ppc64/oprofile_impl.h
+ 3% -23% arch/ppc/configs/ash_defconfig -> arch/ppc64/configs/bpa_defconfig
+ 2% -21% arch/ppc/configs/beech_defconfig -> arch/ppc/configs/ev64360_defconfig
+ 5% -20% arch/ppc/configs/cedar_defconfig -> arch/ppc/configs/mpc8548_cds_defconfig
+ 9% -17% arch/ppc/configs/k2_defconfig -> arch/ppc/configs/bamboo_defconfig
+ 3% -25% arch/ppc/configs/mcpn765_defconfig -> arch/xtensa/configs/common_defconfig
+ 2% -23% arch/ppc/configs/oak_defconfig -> arch/frv/defconfig
+ 3% -16% arch/ppc/configs/SM850_defconfig -> arch/ppc/configs/mpc86x_ads_defconfig
+ 3% -13% arch/ppc/configs/SPD823TS_defconfig -> arch/ppc/configs/mpc885ads_defconfig
+19% -15% arch/um/kernel/tempfile.c -> arch/um/os-Linux/mem.c
+ 0% - 5% arch/x86_64/kernel/semaphore.c -> lib/semaphore-sleepers.c
+ 0% - 6% drivers/i2c/chips/adm1021.c -> drivers/hwmon/adm1021.c
+ 0% - 4% drivers/i2c/chips/adm1025.c -> drivers/hwmon/adm1025.c
+ 0% -17% drivers/i2c/chips/adm1026.c -> drivers/hwmon/adm1026.c
+ 0% - 3% drivers/i2c/chips/adm1031.c -> drivers/hwmon/adm1031.c
+ 0% - 4% drivers/i2c/chips/asb100.c -> drivers/hwmon/asb100.c
+ 1% - 4% drivers/i2c/chips/ds1621.c -> drivers/hwmon/ds1621.c
+ 0% - 1% drivers/i2c/chips/fscher.c -> drivers/hwmon/fscher.c
+ 0% - 2% drivers/i2c/chips/fscpos.c -> drivers/hwmon/fscpos.c
+ 0% - 2% drivers/i2c/chips/gl518sm.c -> drivers/hwmon/gl518sm.c
+ 0% - 2% drivers/i2c/chips/gl520sm.c -> drivers/hwmon/gl520sm.c
+ 3% -19% drivers/i2c/chips/it87.c -> drivers/hwmon/it87.c
+ 4% -22% drivers/i2c/chips/lm63.c -> drivers/hwmon/lm63.c
+ 0% - 6% drivers/i2c/chips/lm75.c -> drivers/hwmon/lm75.c
+ 0% - 2% drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h
+ 0% - 3% drivers/i2c/chips/lm77.c -> drivers/hwmon/lm77.c
+ 2% - 5% drivers/i2c/chips/lm78.c -> drivers/hwmon/lm78.c
+ 0% - 3% drivers/i2c/chips/lm80.c -> drivers/hwmon/lm80.c
+ 2% -21% drivers/i2c/chips/lm83.c -> drivers/hwmon/lm83.c
+ 0% - 3% drivers/i2c/chips/lm85.c -> drivers/hwmon/lm85.c
+ 0% - 4% drivers/i2c/chips/lm87.c -> drivers/hwmon/lm87.c
+ 4% -20% drivers/i2c/chips/lm90.c -> drivers/hwmon/lm90.c
+ 0% - 3% drivers/i2c/chips/lm92.c -> drivers/hwmon/lm92.c
+ 0% - 3% drivers/i2c/chips/max1619.c -> drivers/hwmon/max1619.c
+ 0% - 7% drivers/i2c/chips/sis5595.c -> drivers/hwmon/sis5595.c
+ 0% -11% drivers/i2c/chips/smsc47b397.c -> drivers/hwmon/smsc47b397.c
+ 0% - 9% drivers/i2c/chips/smsc47m1.c -> drivers/hwmon/smsc47m1.c
+ 0% -23% drivers/i2c/chips/via686a.c -> drivers/hwmon/via686a.c
+ 0% - 4% drivers/i2c/chips/w83627hf.c -> drivers/hwmon/w83627hf.c
+ 1% - 5% drivers/i2c/chips/w83781d.c -> drivers/hwmon/w83781d.c
+ 1% - 3% drivers/i2c/chips/w83l785ts.c -> drivers/hwmon/w83l785ts.c
+14% -17% drivers/i2c/i2c-sensor-vid.c -> drivers/hwmon/hwmon-vid.c
+ 0% - 0% drivers/infiniband/include/ib_cache.h -> include/rdma/ib_cache.h
+ 0% - 3% drivers/infiniband/include/ib_fmr_pool.h -> include/rdma/ib_fmr_pool.h
+ 9% - 7% drivers/infiniband/include/ib_mad.h -> include/rdma/ib_mad.h
+ 0% - 0% drivers/infiniband/include/ib_pack.h -> include/rdma/ib_pack.h
+ 1% - 6% drivers/infiniband/include/ib_sa.h -> include/rdma/ib_sa.h
+ 0% -11% drivers/infiniband/include/ib_smi.h -> include/rdma/ib_smi.h
+ 3% - 6% drivers/infiniband/include/ib_user_mad.h -> include/rdma/ib_user_mad.h
+ 4% - 2% drivers/infiniband/include/ib_verbs.h -> include/rdma/ib_verbs.h
+ 0% -16% include/asm-ppc64/ioctl.h -> include/asm-powerpc/ioctl.h
+ 0% - 9% include/asm-ppc64/ioctls.h -> include/asm-powerpc/ioctls.h
+ 5% - 9% include/asm-ppc64/mc146818rtc.h -> include/asm-powerpc/mc146818rtc.h
+ 0% - 5% include/asm-ppc64/mman.h -> include/asm-powerpc/mman.h
+ 2% -25% include/asm-ppc64/sembuf.h -> include/asm-powerpc/sembuf.h
+ 3% -13% include/asm-ppc64/shmbuf.h -> include/asm-powerpc/shmbuf.h
+ 0% -15% include/asm-ppc64/sockios.h -> include/asm-powerpc/sockios.h
+ 1% - 5% include/asm-ppc64/topology.h -> include/asm-powerpc/topology.h
+ 0% -15% include/asm-ppc64/user.h -> include/asm-powerpc/user.h
+ 0% -21% include/asm-ppc/agp.h -> include/asm-powerpc/agp.h
+12% -16% include/asm-ppc/msgbuf.h -> include/asm-xtensa/msgbuf.h
+ 5% -25% include/asm-ppc/namei.h -> include/asm-powerpc/namei.h
+ 4% -18% include/asm-ppc/param.h -> include/asm-powerpc/param.h
+ 0% -13% include/asm-ppc/poll.h -> include/asm-powerpc/poll.h
+ 0% -24% include/asm-ppc/shmbuf.h -> include/asm-xtensa/shmbuf.h
+ 1% -17% include/asm-ppc/socket.h -> include/asm-powerpc/socket.h
+ 0% - 9% include/asm-ppc/string.h -> include/asm-powerpc/string.h
+ 1% -10% include/asm-ppc/termbits.h -> include/asm-powerpc/termbits.h
+ 0% - 3% include/asm-ppc/termios.h -> include/asm-powerpc/termios.h
+ 5% -22% include/asm-ppc/unaligned.h -> include/asm-powerpc/unaligned.h
Regards,
Rutger.
--
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------
#!/usr/bin/env ruby
# Usage: shpatch.rb --help
require 'md5'
require 'ostruct'
require 'optparse'
$config = OpenStruct.new
$config.command = :PATCH
$config.same_base = false
$config.same_ext = true
$config.same_mime = false
$config.changed_content = true
$config.max_removed = 25 # 0 .. 100
$config.max_added = 50
$config.verbose = false
# Default dirglobs to ignore
ignore_globs = [
"BitKeeper", "PENDING", "SCCS", "CVS", "*.state", "*.o", "*.a", "*.so",
"*~", "#*#", "*.orig", "*.dll"
]
# Option parsing
$opts = OptionParser.new
$opts.banner = %Q{\
Generate a shellpatch file, or perform the patch in a shellpatch file.
A shellpatch file is a patch file which contains shell-commands
including 'mv' and 'patch'.
Determining the renames uses a lot of heuristics and a brute-force
approach; your milage may vary. All trivial file renames are handled
by comparing the complete contents. All remaining files (the list of
added and removed files) in then searched through to find matching
pairs: this is quite costly
A cache of md5 sums is kept at the root of the repositories to make
finding differences fast.
(c)2005 R. Nijlunsing <shpatch@xxxxxxxxxxxxx>
License: GPLv2
Usage: shpatch [options]
Defaults options are within [brackets].
}
$opts.separator("Diff options")
$opts.on("-d", "--diff PATH1,PATH2", Array,
"Generate a shellpatch of the diff", "between two directories") {
|paths|
if paths.size != 2
raise Exception.new("Need two directories for --diff")
end
$config.command = :DIFF
$config.paths = paths
}
$opts.separator("Diff options for heuristics to finding renames with changed content")
$opts.on("--[no-]changed-content",
"Find renames with changed content [#{$config.changed_content}]" ) { |cc|
$config.changed_content = cc
}
$opts.on("--[no-]same-base",
"Rename only to files with same basename [#{$config.same_base}]") { |sb|
$config.same_base = sb
}
$opts.on("--[no-]same-ext",
"Rename only to same extention [#{$config.same_ext}]") { |se|
$config.same_ext = se
}
$opts.on("--[no-]same-mime",
"Rename only to same mimetype [#{$config.same_mime}]") { |sm|
$config.same_mime = sm
}
$opts.on("--max-removed PERC", String,
"Max. percentage of source file which may",
"be removed while still being considered",
"a rename [#{$config.max_removed}]"
) { |perc| $config.max_removed = perc.to_i }
$opts.on("--max-added PERC", String,
"Max. percentage of destination file which may",
"be added while still being considered",
"a rename [#{$config.max_added}]"
) { |perc| $config.max_added = perc.to_i }
$opts.separator("Options to add to current patch")
$opts.on("--mv SOURCE DEST", String, String,
"Adds a rename to the current patch", "and perform the rename") {
|path1, path2|
$config.command = :MV
$config.paths = [path1, path2]
}
$opts.separator("General options")
$opts.on("--[no-]verbose", "-v", "Be more verbose") { |v| $config.verbose = v }
$opts.on("--help", "-h", "This usage") { puts $opts; exit 1 }
%Q{
Examples:
shpatch.rb --diff linux-2.6.8,linux-2.6.9 --max-removed 10
Generate a shellpatch with renames from directories
linux-2.6.8 to linux-2.6.9 . At most 10% of a file may be removed
between versions, otherwise they are considered different.
}.split("\n").each { |line| $opts.separator(line) }
begin
$opts.parse!(ARGV)
rescue Exception
puts "#{$opts}\n!!! #{$!}"
exit 1
end
module Shell
# Escape string string so that it is parsed to the string itself
# E.g. Shell.escapeString("what's in a name") = "what\'s\ in\ a\ name"
# Compare to Regexp.escape
def Shell.escape(string)
string.gsub(%r{([^-._0-9a-zA-Z/])}i, '\\\\\1')
end
end
# One hunk in the patch
class RenameHunk
attr_accessor :from, :to # Strings: pathname from and to
def initialize(from, to)
# puts "# Found a rename: #{Shell.escape(from)} -> #{Shell.escape(to)}"
@from = from; @to = to
end
def command; "mv"; end
def to_s; "#{command} #{Shell.escape(@from)} #{Shell.escape(@to)}"; end
def execute(repo)
File.rename("#{repo.root}/#@from", "#{repo.root}/#@to")
end
end
class DeleteHunk
attr_accessor :pathname
def initialize(pathname); @pathname = pathname; end
def command; "rm"; end
def to_s; "#{command} #{Shell.escape(@pathname)}"; end
def execute(repo); File.delete("#{repo.root}/#@pathname"); end
end
class PatchHunk
attr_accessor :from, :to, :contents
def initialize(repo1, from, repo2, to)
@from = from; @to = to
end
def command; "patch"; end
def to_s
long_from = Shell.escape((from[0] == ?/ ? "" : repo1.root + "/") + from)
long_to = Shell.escape((to[0] == ?/ ? "" : repo2.root + "/") + to)
puts "# Diffing #{long_from} -> #{long_to}" if $config.verbose
@contents = File.popen("diff --unified #{long_from} #{long_to}") { |io|
io.read
}
mark = "_SHPATCHMARK_"
# Make mark unique
mark += rand(10).to_s while @contents.index(mark)
"#{command} <<#{mark}\n#{@contents}#{mark}"
end
end
# A filesystem as backing store
class FileSystem
SHPATCHSTATE_FILE = ".shpatch.state"
SHPATCHSTATE_VERSION_STRING = "shpatch.rb state version 20050418-2"
attr_accessor :root
attr_accessor :cache_file # String: filename with signatures
attr_accessor :signature_cache # From Fixnum inode to Array [mtime, sig]
attr_accessor :signature_cache_changed # Boolean
# Reads the cache. When not readable in current directory, go
# up a level ('..')
def read_signatures
@signature_cache = {}
@signature_cache_changed = false
@cache_file = File.expand_path("#@root/#{SHPATCHSTATE_FILE}")
cache_file = @cache_file
loop {
if FileTest.readable?(cache_file)
File.open(cache_file, "rb") do |file|
version_string = file.readline.chomp
if version_string == SHPATCHSTATE_VERSION_STRING
begin
@signature_cache = Marshal.load(file)
puts "# Read signature cache with #{@signature_cache.size} signatures from #{cache_file.inspect}" if $config.verbose
@cache_file = cache_file
break
rescue ArgumentError, EOFError
puts "# (error reading state file: rebuilding file...)" if $config.verbose
end
end
end
end
parent_cache_file = File.expand_path(
File.dirname(cache_file) + "/../" + File.basename(cache_file)
)
break if parent_cache_file == cache_file
cache_file = parent_cache_file
}
end
def initialize(root)
raise "#{root.inspect} does not exist" if not File.exists?(root)
@root = root
read_signatures
end
def save_signatures
# Save all unsaved signature cache
return if !@signature_cache_changed
puts "# Saving #{@signature_cache.size} signatures..." if $config.verbose
pf = @cache_file
File.open("#{pf}.new", "wb+") do |file|
file.puts SHPATCHSTATE_VERSION_STRING
Marshal.dump(@signature_cache, file)
File.rename("#{pf}.new", pf)
end
end
# Returns array of [mtime, one-line signature-string]
def signature(stat, filename)
signature = nil
key = [stat.dev, stat.ino]
cache = @signature_cache[key]
if cache and (cache[0] == stat.mtime)
signature = cache[1]
else
if $config.verbose
why = (cache ? "#{(stat.mtime - cache[0]).to_i}s out of date" : "not indexed")
puts "# Creating signature for #{filename.inspect} (#{why})"
end
signature = MD5.new(File.read(filename)).digest
@signature_cache[key] = [stat.mtime, signature]
@signature_cache_changed = true
end
signature
end
def signature_from(prefix, res, from, ignoreRe)
Dir.new("#{prefix}#{from}").entries.each { |elem|
next if (elem == ".") or (elem == "..")
fullname = "#{prefix}#{from}/#{elem}"
if not fullname =~ ignoreRe
stat = File.stat(fullname)
if stat.directory?
signature_from(prefix, res, "#{from}/#{elem}", ignoreRe)
else
rel_filename = "#{from}/#{elem}"[1..-1]
res[rel_filename] = signature(stat, fullname)
end
end
}
end
# Returns all filenames within this filesystem with all signatures
def signatures(ignoreRe)
res = {}
prefix = File.expand_path(@root)
signature_from(prefix, res, "", ignoreRe)
save_signatures
res
end
def mime_type(filename)
path = @root + "/" + filename
($mime_cache ||= {})[path] ||=
File.popen("file --mime #{Shell.escape(path)}") { |io| io.read }.
gsub(%r{^.*:}, "").strip
end
# Read the contents of a file
def read(filename); File.read(@root + "/" + filename); end
end
patch = []
dir1, dir2 = $config.paths
repo1 = FileSystem.new(dir1)
repo2 = FileSystem.new(dir2)
def re_from_globs(globs)
Regexp.new(
"(\\A|/)(" + globs.collect { |glob|
Regexp.escape(glob).gsub("\\*", "[^/]*")
}.join("|") + ")$"
)
end
ignore_globs += ["BitKeeper/etc/ignore", ".cvsignore"].collect { |a|
["#{dir1}/#{a}", "#{dir2}/#{a}"]
}.flatten.find_all { |f| File.exists?(f) }.collect { |f|
File.readlines(f).collect { |line| line.chomp }
}.flatten
ignore_globs = ignore_globs.uniq.sort
ignoreRe = re_from_globs(ignore_globs)
puts "# Retrieving signatures of #{dir1.inspect}" if $config.verbose
file2sig1 = repo1.signatures(ignoreRe)
puts "# Retrieving signatures of #{dir2.inspect}" if $config.verbose
file2sig2 = repo2.signatures(ignoreRe)
files1 = file2sig1.keys.sort
files2 = file2sig2.keys.sort
common_files = files1 - (files1 - files2)
# Different hash, same filename: patch
common_files.each { |fname|
if file2sig1[fname] != file2sig2[fname]
patch << PatchHunk.new(repo1, fname, repo2, fname)
end
file2sig1.delete(fname)
file2sig2.delete(fname)
}
# Same hash, different filename: rename
sig2file1 = file2sig1.invert
sig2file2 = file2sig2.invert
sigs1 = sig2file1.keys
sigs2 = sig2file2.keys
common_sigs = sigs1 - (sigs1 - sigs2)
common_sigs.each { |sig|
from = sig2file1[sig]
to = sig2file2[sig]
patch << RenameHunk.new(from, to)
sig2file1.delete(sig)
sig2file2.delete(sig)
file2sig1.delete(from)
file2sig2.delete(to)
}
# statistics of contents of a file. Used for quick-compare
class FileContentStats
attr_accessor :size # Size of file in lines
attr_accessor :lines # Hash from String to Fixnum
# Counter number of lines removed and added as a percentage
# of the total file length. These are a measure for the degree
# of matching between the files.
def diff_match(other)
added = 0
removed = 0
@lines.each_pair { |line, count|
delta = other.lines[line] - count
if delta > 0
added += delta
else
removed += -delta
end
}
other.lines.each_pair { |line, count|
added += count if not @lines[line]
}
[added * 100 / other.size, removed * 100 / self.size]
end
def initialize(repo, path)
@lines = Hash.new(0)
size = 0
repo.read(path).delete("\0").each_line { |line|
@lines[line.intern] += 1
size += 1
}
@size = size
end
def self.cached(repo, path)
@@cache ||= {}
@@cache[[repo, path]] ||= self.new(repo, path)
end
end
# Categorize a file based on filename and/or contents
def pool_type(repo, path)
res = []
res << File.basename(path) if $config.same_base
res << File.extname(path) if $config.same_ext
res << repo.mime_type(path) if $config.same_mime
res
end
# Determine how much a filename looks like another filename
# by splitting the filenames into words. Then count the
# words which are the same.
def path_correlation(path1, path2)
comp1 = path1.split(%r{[-._/]})
comp2 = path2.split(%r{[-._/]})
(comp1 - (comp1 - comp2)).size
end
class Array
# The inverse of an array is an hash from contents to index number.
def inverse; res = {}; each_with_index { |e, idx| res[e] = idx }; res; end
end
if $config.changed_content
files1 = file2sig1.keys.sort
files2 = file2sig2.keys.sort
all_added_files = files2 - files1
all_removed_files = files1 - files2
pools = {} # Group files into 'pools'
all_removed_files.each { |removed_file|
(pools[pool_type(repo1, removed_file)] ||= [[], []])[0] << removed_file
}
all_added_files.each { |added_file|
(pools[pool_type(repo2, added_file)] ||= [[], []])[1] << added_file
}
pools.each_pair { |key, pool|
removed_files, added_files = *pool
if $config.verbose and not removed_files.empty? and not added_files.empty?
puts "# Comparing pool type #{key.inspect} with #{pool[0].size}x#{pool[1].size} filepairs"
end
# Determine how 'special' or 'specific' a word is. We start with
# filenames containing special words.
words = {} # Group files by 'words'
removed_files.each { |removed_file|
removed_file.split(%r{[-._/]+}).uniq.each { |word|
words[word] ||= [[], []]
words[word][0] << removed_file
}
}
added_files.each { |added_file|
added_file.split(%r{[-._/]+}).uniq.each { |word|
words[word] ||= [[], []]
words[word][1] << added_file
}
}
word_importance = words.keys.find_all { |word|
(words[word][0].size * words[word][1].size) > 0
}.sort_by { |word|
words[word][0].size * words[word][1].size
}.reverse
# p word_importance
word_importance = word_importance.inverse
word_importance.default = 0
removed_files.sort_by { |removed_file|
removed_file.split(%r{[-._/]+}).uniq.inject(0) { |s, e|
[s, word_importance[e]].max
}
}.reverse.each { |removed_file|
# puts removed_file
removed_file_stats = FileContentStats.new(repo1, removed_file)
added_files.sort_by { |f| -path_correlation(removed_file, f) }.
each { |added_file|
added_file_stats = FileContentStats.cached(repo2, added_file)
removed_size = removed_file_stats.size
added_size = added_file_stats.size
min_added = (added_size - removed_size) * 100 / added_size
next if min_added > $config.max_added
min_removed = (removed_size - added_size) * 100 / removed_size
next if min_removed > $config.max_removed
# Calculate added & removed percentages
added, removed = removed_file_stats.diff_match(added_file_stats)
if (added <= $config.max_added) && (removed <= $config.max_removed)
# We found a rename-match!
puts "+%2i%% -%2i%% #{removed_file} -> #{added_file}" % [added, removed] #if $config.verbose
patch << RenameHunk.new(removed_file, added_file)
# Don't match again against this added file:
added_files -= [added_file]
all_added_files -= [added_file]
all_removed_files -= [removed_file]
patch << PatchHunk.new(repo1, removed_file, repo2, added_file)
break
end
}
}
}
end
all_added_files.each { |added_file|
patch << PatchHunk.new(repo1, "/dev/null", repo2, added_file)
}
all_removed_files.each { |removed_file|
patch << PatchHunk.new(repo1, removed_file, repo2, "/dev/null")
}
#patch.each { |hunk| puts hunk.to_s }