Hi Junio, On Wed, 5 May 2021, Junio C Hamano wrote: > "Phillip Wood via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > > > From: Phillip Wood <phillip.wood@xxxxxxxxxxxxx> > > > > xdl_prepare_env() calls xdl_classify_record() which arranges for the > > hashes of non-matching lines to be different so lines can be tested > > for equality by comparing just their hashes. > > Hmph, that is a bit different from what I read from the comment in > the post context of the first hunk, though. > > /* > * After xdl_prepare_env() (or more precisely, due to > * xdl_classify_record()), the "ha" member of the records (AKA lines) > * is _not_ the hash anymore, but a linearized version of it. In > * other words, the "ha" member is guaranteed to start with 0 and > * the second record's ha can only be 0 or 1, etc. > * > * So we multiply ha by 2 in the hope that the hashing was > * "unique enough". > */ > > The words "home" and "enough" hints to me that the "ha" member is > not hash, but "lineralized version of it" (whatever it means) does > not guarantee that two records with the same "ha" are identical, or > does it? > > Well, I should just go read xdl_classify_record() to see what it > really does, but if it eliminates collisions, then the patch is a > clear and obvious improvement. Right. I had the same concern. But it does look as if `xdl_classify_record()` replaced the possibly non-unique hash values to unique sequential identifiers. I have to admit that the code is unnecessarily hard to read for me: https://github.com/git/git/blob/v2.31.1/xdiff/xprepare.c#L110-L157 But I do gather that the loop at https://github.com/git/git/blob/v2.31.1/xdiff/xprepare.c#L119-L123 is called for every line, that it does compare it to every seen line with the same hash, and that it exits the loop early if the contents disagree: for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next) if (rcrec->ha == rec->ha && xdl_recmatch(rcrec->line, rcrec->size, rec->ptr, rec->size, cf->flags)) break; Since naming is hard (and you can easily err on saving space at the expense of costing readers' time, as libxdiff proves), and since I am running out of review time, I'll have to assume that https://github.com/git/git/blob/v2.31.1/xdiff/xprepare.c#L150-L154 means that indeed, the `ha` field is set to a counter that uniquely identifies the line contents: rec->ha = (unsigned long) rcrec->idx; hi = (long) XDL_HASHLONG(rec->ha, hbits); rec->next = rhash[hi]; rhash[hi] = rec; So I am fairly confident that the patch is good, and the performance win is nice. Thanks! Dscho > > Thanks. > > > diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c > > index 20699a6f6054..db2d53e89cb0 100644 > > --- a/xdiff/xpatience.c > > +++ b/xdiff/xpatience.c > > @@ -90,7 +90,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map, > > { > > xrecord_t **records = pass == 1 ? > > map->env->xdf1.recs : map->env->xdf2.recs; > > - xrecord_t *record = records[line - 1], *other; > > + xrecord_t *record = records[line - 1]; > > /* > > * After xdl_prepare_env() (or more precisely, due to > > * xdl_classify_record()), the "ha" member of the records (AKA lines) > > @@ -104,11 +104,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map, > > int index = (int)((record->ha << 1) % map->alloc); > > > > while (map->entries[index].line1) { > > - other = map->env->xdf1.recs[map->entries[index].line1 - 1]; > > - if (map->entries[index].hash != record->ha || > > - !xdl_recmatch(record->ptr, record->size, > > - other->ptr, other->size, > > - map->xpp->flags)) { > > + if (map->entries[index].hash != record->ha) { > > if (++index >= map->alloc) > > index = 0; > > continue; > > @@ -253,8 +249,7 @@ static int match(struct hashmap *map, int line1, int line2) > > { > > xrecord_t *record1 = map->env->xdf1.recs[line1 - 1]; > > xrecord_t *record2 = map->env->xdf2.recs[line2 - 1]; > > - return xdl_recmatch(record1->ptr, record1->size, > > - record2->ptr, record2->size, map->xpp->flags); > > + return record1->ha == record2->ha; > > } > > > > static int patience_diff(mmfile_t *file1, mmfile_t *file2, >