Re: [PATCH 01/10] list_lookup: create case and length search

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Junio C Hamano <gitster@xxxxxxxxx> writes:

> Antoine Pelisse <apelisse@xxxxxxxxx> writes:
>
>> Create a new function to look-up a string in a string_list, but:
>>  - add a new parameter to ignore case differences
>>  - add a length parameter to search for a substring
>>
>> The idea is to avoid several copies (lowering a string before searching
>> it when we just want to ignore case), or copying a substring of a bigger
>> string to search it in the string_list
>>
>> Signed-off-by: Antoine Pelisse <apelisse@xxxxxxxxx>
>> ---
>
> I did not read beyond the log message and the function signature of
> the updated get_entry_index(), but it strikes me insane to build a
> sorted list using case sensitive full string comarison and then to
> look for an element in that list using a different comparison
> criteria (e.g. case insensitive comparison) and to expect the
> bisection search to still work.  Shouldn't the codepath that builds
> the string-list be sorting the list in case insensitive way from the
> beginning if this were to work correctly?
>
> It seems to suggest to me that this "are the keys case sensitive?"
> bit belongs to the entire struct string_list structure as its
> property (similar to the way "do the keys need to be strdup'ed?"
> bit), not something you would flip per invocation basis of the
> lookup function.
>
> Also isn't size_t an unsigned type?  What does it even mean to pass
> "-1" to it, and propagate it down to strncmp()?
>
> If you have a list sorted by a key, and if you want to query it with
> a partial prefix of a possibly valid key, I think you shouldn't have
> to add the "length search" at all. The existing look up function
> would return the location in the list that short key would have been
> inserted, which would come before the first entry that your short
> key is a prefix of, so the caller can iterate the list from there to
> find all entries.  In other words, if existing list has "aaa",
> "bba", and "bbc", and you want to grab entries that begin with "bb",
> you can ask for "bb" to the loop up function, which would say "the
> key does not exist in the list, but it would be inserted before this
> 'bba' entry".  Then you can discover that "bba" and "bbc" both
> matches the shortened key you have, no?

In the above, I mixed up which string is a prefix of which, but the
overall comment still stands, I think.

Your mailmap string list has "Ee@eee" (email part only) stored, while
the record from author/committer line has "NNNN <EE@EEE> TTTTTT sZZZZ",
to hold name, timestamp and zone.  You have a pointer email pointing
at the first "E" in that record, with maillen set to 6, and see if
the mailmap contains a string that match "EE@EEE".  So I mixed up
the prefix-ness of these strings.

Let's address the case sensitivety first.  The string_list_lookup()
and your string_list_lookup_extended() functions both work on a
string list that is built by add_mapping() function that asks
string_list_find_insert_index() to find an insertion point in the
sorted string list.  If this is not done using the same case
insensitive manner for a string list that is meant to be queried
case insensitively, the resulting list is still sorted case
sensitively, and look-up functions that work by binary search won't
find what you are looking for when they try to find the string using
case-insensitive comparison.  The change to implement it would
probably look like the attached patch.

Assuming that we can leave the case sensitivity behind us that way,
we still would need a new helper function

    string_list_lookup_prefix(struct string_list *, const char*, size_t)

so that you can find an existing "Ee@eee" entry in the string list
by email that points at "EE@EEE> TTTTTT sZZZZ" with maillen of 6.

I am wondering if we can do that without adding the length-limited
sort in the low-level get_entry_index() to do so, though.

You could first ask string_list_find_insert_index() to locate the
insertion point for "EE@EEE> TTTTTT sZZZZ".  That has to come after
"Ee@eee" if it exists (or it could be "De@eee" in which case you
know there is no match).

 1. Ask string_list_find_insert_index() to locate "EE@EEE> TTTTTT..."

    a. If it says there is an exact match, then that did not match
       "EE@EEE" list (you found "EE@EEE> TTTTTT sZZZZ").  You might
       still find "EE@EEE" before that record, though.  Also your
       key may have been "EE@EEE" exactly, in which case you already
       found what you were looking for.

    b. If it says there is not an exact match, the returned value is
       pointing at a record that comes after "EE@EEE> TTTTTT sZZZZ" if
       you were to insert it, which definitely is after "EE@EEE".

    So in either case, you know what you are looking for must come
    before string_list_find_insert_index() gave you.  Let's call
    that position "i".

    You still need to be careful that there could be an existing
    entry whose key is "EE@EEE" followed by a byte that sorts before
    ">" or " " (we do not want to limit this new generic string-list
    function to operate only on e-mail addresses), though.  So the
    above "must come before" is not "must come immediately before".

 2. Then compare map->items[i].string and email using strncasecmp()
    for maillen bytes, and make sure map->items[].string is exactly
    maillen bytes long.  If it matches, you found what you were
    looking for.  Otherwise, decrement "i" (e.g. the overlong key
    string you had was "EE@EEE> TTTTTT..." and the record at one
    before the insertion point was "EE@EEE\t" which sorts before it,
    but the string-list may have "EE@EEE" before that entry), see if
    the first maillen bytes still match, and continue.  Once you
    decrement i enough, the first maillen bytes won't match
    (e.g. you hit "De@EEE"), or "i" goes negative, and you can stop.

or something like that.  I'll cook up a patch and see how ugly it
ends up with.

-- >8 --
Subject: [PATCH] string-list: allow case-insensitive string list

Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
---

 * At the very beginning of mailmap.c::read_mailmap(), you would say
   something like:

	int read_mailmap(struct string_list *map, char **repo_ident)
        {
        	map->strdup_strings = 1;
       +        map->cmp = strcasecmp;
                /* each failure returns 1, so >1 means both calls ...

   to make the mailmap string list a case insensitive one.

 string-list.c | 17 +++++++++++++----
 string-list.h |  4 ++++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/string-list.c b/string-list.c
index 397e6cf..6f3d8cf 100644
--- a/string-list.c
+++ b/string-list.c
@@ -7,10 +7,11 @@ static int get_entry_index(const struct string_list *list, const char *string,
 		int *exact_match)
 {
 	int left = -1, right = list->nr;
+	compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
 
 	while (left + 1 < right) {
 		int middle = (left + right) / 2;
-		int compare = strcmp(string, list->items[middle].string);
+		int compare = cmp(string, list->items[middle].string);
 		if (compare < 0)
 			right = middle;
 		else if (compare > 0)
@@ -96,8 +97,9 @@ void string_list_remove_duplicates(struct string_list *list, int free_util)
 {
 	if (list->nr > 1) {
 		int src, dst;
+		compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
 		for (src = dst = 1; src < list->nr; src++) {
-			if (!strcmp(list->items[dst - 1].string, list->items[src].string)) {
+			if (!cmp(list->items[dst - 1].string, list->items[src].string)) {
 				if (list->strdup_strings)
 					free(list->items[src].string);
 				if (free_util)
@@ -230,15 +232,20 @@ struct string_list_item *string_list_append(struct string_list *list,
 			list->strdup_strings ? xstrdup(string) : (char *)string);
 }
 
+/* Yuck */
+static compare_strings_fn compare_for_qsort;
+
+/* Only call this from inside sort_string_list! */
 static int cmp_items(const void *a, const void *b)
 {
 	const struct string_list_item *one = a;
 	const struct string_list_item *two = b;
-	return strcmp(one->string, two->string);
+	return compare_for_qsort(one->string, two->string);
 }
 
 void sort_string_list(struct string_list *list)
 {
+	compare_for_qsort = list->cmp ? list->cmp : strcmp;
 	qsort(list->items, list->nr, sizeof(*list->items), cmp_items);
 }
 
@@ -246,8 +253,10 @@ struct string_list_item *unsorted_string_list_lookup(struct string_list *list,
 						     const char *string)
 {
 	int i;
+	compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
+
 	for (i = 0; i < list->nr; i++)
-		if (!strcmp(string, list->items[i].string))
+		if (!cmp(string, list->items[i].string))
 			return list->items + i;
 	return NULL;
 }
diff --git a/string-list.h b/string-list.h
index c50b0d0..446e79e 100644
--- a/string-list.h
+++ b/string-list.h
@@ -5,10 +5,14 @@ struct string_list_item {
 	char *string;
 	void *util;
 };
+
+typedef int (*compare_strings_fn)(const char *, const char *);
+
 struct string_list {
 	struct string_list_item *items;
 	unsigned int nr, alloc;
 	unsigned int strdup_strings:1;
+	compare_strings_fn cmp; /* NULL uses strcmp() */
 };
 
 #define STRING_LIST_INIT_NODUP { NULL, 0, 0, 0 }
-- 
1.8.1.304.gf036638




--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]