Re: [PATCH 1/2] patience diff: remove unnecessary string comparisons

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dscho

On 05/05/2021 15:58, Johannes Schindelin wrote:
Hi Junio,

On Wed, 5 May 2021, Junio C Hamano wrote:

"Phillip Wood via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

From: Phillip Wood <phillip.wood@xxxxxxxxxxxxx>

xdl_prepare_env() calls xdl_classify_record() which arranges for the
hashes of non-matching lines to be different so lines can be tested
for equality by comparing just their hashes.

Hmph, that is a bit different from what I read from the comment in
the post context of the first hunk, though.

	/*
	 * After xdl_prepare_env() (or more precisely, due to
	 * xdl_classify_record()), the "ha" member of the records (AKA lines)
	 * is _not_ the hash anymore, but a linearized version of it.  In
	 * other words, the "ha" member is guaranteed to start with 0 and
	 * the second record's ha can only be 0 or 1, etc.
	 *
	 * So we multiply ha by 2 in the hope that the hashing was
	 * "unique enough".
	 */

The words "home" and "enough" hints to me that the "ha" member is
not hash, but "lineralized version of it" (whatever it means) does
not guarantee that two records with the same "ha" are identical, or
does it?

Well, I should just go read xdl_classify_record() to see what it
really does, but if it eliminates collisions, then the patch is a
clear and obvious improvement.

Right. I had the same concern. But it does look as if
`xdl_classify_record()` replaced the possibly non-unique hash values to
unique sequential identifiers.

I have to admit that the code is unnecessarily hard to read for me:
https://github.com/git/git/blob/v2.31.1/xdiff/xprepare.c#L110-L157

But I do gather that the loop at
https://github.com/git/git/blob/v2.31.1/xdiff/xprepare.c#L119-L123
is called for every line, that it does compare it to every seen line with
the same hash, and that it exits the loop early if the contents disagree:

	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
		if (rcrec->ha == rec->ha &&
				xdl_recmatch(rcrec->line, rcrec->size,
					rec->ptr, rec->size, cf->flags))
			break;

Since naming is hard (and you can easily err on saving space at the
expense of costing readers' time, as libxdiff proves), and since I am
running out of review time, I'll have to assume that
https://github.com/git/git/blob/v2.31.1/xdiff/xprepare.c#L150-L154 means
that indeed, the `ha` field is set to a counter that uniquely identifies
the line contents:

	rec->ha = (unsigned long) rcrec->idx;


	hi = (long) XDL_HASHLONG(rec->ha, hbits);
	rec->next = rhash[hi];
	rhash[hi] = rec;

So I am fairly confident that the patch is good, and the performance win
is nice.

Thanks for taking the time to review this patch, I agree with your analysis. The output of `git log -p --diff-algorithm=patience origin/master` for the whole history of git.git is unchanged by this patch.

Best Wishes

Phillip


Thanks!
Dscho


Thanks.

diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 20699a6f6054..db2d53e89cb0 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -90,7 +90,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
  {
  	xrecord_t **records = pass == 1 ?
  		map->env->xdf1.recs : map->env->xdf2.recs;
-	xrecord_t *record = records[line - 1], *other;
+	xrecord_t *record = records[line - 1];
  	/*
  	 * After xdl_prepare_env() (or more precisely, due to
  	 * xdl_classify_record()), the "ha" member of the records (AKA lines)
@@ -104,11 +104,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
  	int index = (int)((record->ha << 1) % map->alloc);

  	while (map->entries[index].line1) {
-		other = map->env->xdf1.recs[map->entries[index].line1 - 1];
-		if (map->entries[index].hash != record->ha ||
-				!xdl_recmatch(record->ptr, record->size,
-					other->ptr, other->size,
-					map->xpp->flags)) {
+		if (map->entries[index].hash != record->ha) {
  			if (++index >= map->alloc)
  				index = 0;
  			continue;
@@ -253,8 +249,7 @@ static int match(struct hashmap *map, int line1, int line2)
  {
  	xrecord_t *record1 = map->env->xdf1.recs[line1 - 1];
  	xrecord_t *record2 = map->env->xdf2.recs[line2 - 1];
-	return xdl_recmatch(record1->ptr, record1->size,
-		record2->ptr, record2->size, map->xpp->flags);
+	return record1->ha == record2->ha;
  }

  static int patience_diff(mmfile_t *file1, mmfile_t *file2,





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux