Re: [PATCH 4/4] submodule: port submodule subcommand 'summary' from shell to C

Shourya Shukla <shouryashukla.oo@xxxxxxxxx> · Mon, 6 Jul 2020 16:45:59 +0530

On 06/07 02:46, Kaartic Sivaraam wrote:
> Note: I've added some comment but I've not been able to address all the
> parts due to lack of time. I'm sending it sooner hoping it would be
> useful.

Sure no problem

> On 05-07-2020 23:04, Shourya Shukla wrote:
> [...]
> >> [exchanging Stefan Beller's dysfunct @google address for their private
> >> one; I encourage you to do the same in the next iteration, probably
> >> by editing the `Mentored-by:` line.]
> > 
> > I think you missed to mention it.
> >
> 
> If you're looking for the private e-mail. The exachange was already done
> and it was right there in the Cc list of the mail sent by Dscho. I've
> added it again as you seem to have removed it.

Oh I did not catch that actually. Thanks!

> >> On Fri, 3 Jul 2020, Shourya Shukla wrote:
> >>
> >>> From: Prathamesh Chavan <pc44800@xxxxxxxxx>
> >>>
> >>> The submodule subcommand 'summary' is ported in the process of
> >>> making git-submodule a builtin. The function cmd_summary() from
> >>> git-submodule.sh is ported to functions module_summary(),
> >>> compute_summary_module_list(), prepare_submodule_summary() and
> >>> generate_submodule_summary(), print_submodule_summary().
> >>>
> >>> The first function module_summary() parses the options of submodule
> >>> subcommand and also acts as the front-end of this subcommand.
> >>> After parsing them, it calls the compute_summary_module_list()
> >>
> >> Missing full-stop, and probably the sentence also wanted to say "function"
> >> at the end.
> > 
> > I will correct. Thanks for pointing out!
> > 
> >>> The functions compute_summary_module_list() runs the diff_cmd,
> >>> and generates the modules list, as required by the subcommand.
> >>> The generation of this module list is done by the using the
> >>
> >> s/the using/using/
> > 
> > Will amend!
> > 
> >>> callback function submodule_summary_callback(), and stored in the
> >>> structure module_cb.
> >>
> >> This explains nicely what the patch does. But the commit message should
> >> not really repeat what can be readily deduced from the patch; It should
> >> focus on the motivation and on interesting background information that is
> >> _not_ readily deduced from the patch.
> > 
> > I understand. I will follow your suggestions regarding my patch.
> > 
> >> For example, I see that `$diff_cmd` is called twice in the shell script
> >> version, once to "get modified modules cared by user" and then _again_,
> >> with that list of modified modules.
> >>
> >> I would have liked to see a reasoning in the commit message that explains
> >> why this has to be so in the C version. I get why it is complicated in a
> >> shell script (which lacks proper objects, after all), but I would have
> >> expected the C version to be able to accumulate the information with a
> >> single pass.
> >>
> >> (Before writing the following paragraph, I actually reviewed the patch
> >> from bottom to top, in the caller->callee direction.)
> >>
> >> Ah. I see that this indeed is the case: there is only one pass in the C
> >> version. That's a useful piece of metadata for the commit message, I
> >> think, much more useful than describing the call tree of the functions.
> > 
> > Yup that it worth mentioning.
> > 
> >> Another thing worth mentioning in the commit message is that we use the
> >> combination of setting a child_process' working directory to the submodule
> >> path and then calling `prepare_submodule_repo_env()` which also sets
> >> `GIT_DIR` to `.git`, so that we can be certain that those spawned
> >> processes will not access the superproject's ODB by mistake.
> >>
> >> When reading my suggestions, please keep in mind that I reviewed the
> >> functions in caller->callee order, i.e. I started at the end of the patch
> >> and then worked my way up.
> >>
> >> All in all, I like the function structure, but I think there is still a
> >> bit room for improvement in a v2.
> > 
> >>> +static int verify_submodule_object_name(const char *sm_path,
> >>> +					  const char *sha1)
> >>> +{
> >>> +	struct child_process cp_rev_parse = CHILD_PROCESS_INIT;
> >>> + > > +	cp_rev_parse.git_cmd = 1;
> >>> +	cp_rev_parse.no_stdout = 1;
> >>> +	cp_rev_parse.dir = sm_path;
> >>
> >> So here we specify `sm_path` as current working directory.
> >>
> >>> +	prepare_submodule_repo_env(&cp_rev_parse.env_array);
> >>
> >> And this implicitly sets `GIT_DIR=.git`. Good.
> >>
> >>> +	argv_array_pushl(&cp_rev_parse.args, "rev-parse", "-q",
> >>> +			 "--verify", NULL);
> >>> +	argv_array_pushf(&cp_rev_parse.args, "%s^0", sha1);
> >>
> >> After this, we should also append `--` to make sure that we're not parsing
> >> this as a file name.
> > 
> > Will do!
> > 
> >> Two comments about naming: `sha1` is pretty misleading here, as we do not
> >> require it to be a SHA-1 (especially in the future in which we switch to
> >> SHA-256). Besides, what we're really asking for (via that `^0` suffix) is
> >> a committish. Therefore, I would propose to use `committish` both in the
> >> parameter name as well as the function name.
> > 
> > I am not aware of this change. I will take this suggestion into account.
> > 
> >>> +
> >>> +	if (run_command(&cp_rev_parse))
> >>> +		return 1;
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static void print_submodule_summary(struct summary_cb *info, int errmsg,
> >>> +				      int total_commits, int missing_src,
> >>> +				      int missing_dst, const char *displaypath,
> >>> +				      int is_sm_git_dir, struct module_cb *p)
> >>> +{
> >>> +	if (p->status == 'T') {
> >>> +		if (S_ISGITLINK(p->mod_dst))
> >>> +			printf(_("* %s %s(blob)->%s(submodule)"),
> >>> +				 displaypath, find_unique_abbrev(&p->oid_src, 7),
> >>
> >> The shell script version does this:
> >>
> >>                 sha1_abbr_src=$(GIT_DIR="$name/.git" git rev-parse --short $sha1_src 2>/dev/null ||
> >>                         echo $sha1_src | cut -c1-7)
> >>
> >> That is not quite the same, as it looks for the abbreviation _in the
> >> submodule_, not in the current project. So I think `find_unique_abbrev()`
> >> is not correct here.
> >>
> >> The funny thing is that we _already_ will have called `git rev-parse
> >> --verify` for both `p->oid_src` and `p->oid_dst` in the submodule, in the
> >> caller of this function! And while we throw away the result, and while we
> >> do not pass `--short`, there is no reason why we shouldn't be able to do
> >> precisely that.
> > 
> > Okay so you are saying that there is no need of a 'find_unique_abbrev()'
> > since we would be easily able to obtain these values from the caller of
> > 'print_submodule_summary()' right?
> 
> Yes. 'generate_submodule_summary' already does a rev-parse on
> p->oid_src and p->oid_dst via 'verify_submodule_object_name'.
> We should be able to get the short version of them by passing '--short'
> to rev-parse there and make it return the short SHA1. We can then use it
> like how Dscho mentions below.

Okay. That seems good. Yes, it will be cleaner to pass them as
arguments.

> > Maybe we can pass 'oid_src' or 'oid_dst' as an argument?
> > 
> >>> +				 find_unique_abbrev(&p->oid_dst, 7));
> >>> +		else
> >>> +			printf(_("* %s %s(submodule)->%s(blob)"),
> >>> +				 displaypath, find_unique_abbrev(&p->oid_src, 7),
> >>> +				 find_unique_abbrev(&p->oid_dst, 7));
> >>> +	} else {
> >>> +		printf("* %s %s...%s",
> >>> +			displaypath, find_unique_abbrev(&p->oid_src, 7),
> >>> +			find_unique_abbrev(&p->oid_dst, 7));
> >>> +	}
> >>> +
> >>> +	if (total_commits < 0)
> >>> +		printf(":\n");
> >>> +	else
> >>> +		printf(" (%d):\n", total_commits);
> >>> +
> >>> +	if (errmsg) {
> >>> +		/*
> >>> +		 * Don't give error msg for modification whose dst is not
> >>> +		 * submodule, i.e. deleted or changed to blob
> >>> +		 */
> >>> +		if (S_ISGITLINK(p->mod_src)) {
> >>> +			if (missing_src && missing_dst) {
> >>> +				printf(_("  Warn: %s doesn't contain commits %s and %s\n"),
> >>> +				       displaypath, oid_to_hex(&p->oid_src),
> >>> +				       oid_to_hex(&p->oid_dst));
> >>> +			} else if (missing_src) {
> >>> +				printf(_("  Warn: %s doesn't contain commit %s\n"),
> >>> +				       displaypath, oid_to_hex(&p->oid_src));
> >>> +			} else {
> >>> +				printf(_("  Warn: %s doesn't contain commit %s\n"),
> >>> +				       displaypath, oid_to_hex(&p->oid_dst));
> >>> +			}
> >>> +		}
> >>> +	} else if (is_sm_git_dir) {
> >>> +		struct child_process cp_log = CHILD_PROCESS_INIT;
> >>> +
> >>> +		cp_log.git_cmd = 1;
> >>> +		cp_log.dir = p->sm_path;
> >>> +		prepare_submodule_repo_env(&cp_log.env_array);
> >>
> >> Since the working directory is set to the top-level directory of the
> >> submodule, and since `prepare_submodule_repo_env()` sets `GIT_DIR` to
> >> `.git`, I think that the `is_sm_git_dir` condition is unnecessary. In
> >> fact, the entire `is_sm_git_dir` parameter (and local variable in the
> >> caller, see more on that below) can go away.
> > 
> > Because we already set the $GIT_DIR to .git/ so an extra check will not
> > be necessary right?
> > 
> 
> Yes. If we remove that check and we get a p->sm_path that does not point
> to a submodule, I wonder what would happen if we run 'run_command'
> on it. I'm also not sure if that's a possible case. Something to
> explore.

This might take some time but would be a huge change for this patch if
successful.

> >>> +		argv_array_pushl(&cp_log.args, "log", NULL);
> >>> +
> >>> +		if (S_ISGITLINK(p->mod_src) && S_ISGITLINK(p->mod_dst)) {
> >>> +			if (info->summary_limit > 0)
> >>> +				argv_array_pushf(&cp_log.args, "-%d",
> >>> +						 info->summary_limit);
> >>> +
> >>> +			argv_array_pushl(&cp_log.args, "--pretty=  %m %s",
> >>> +					 "--first-parent", NULL);
> >>> +			argv_array_pushf(&cp_log.args, "%s...%s",
> >>> +					 oid_to_hex(&p->oid_src),
> >>> +					 oid_to_hex(&p->oid_dst));
> >>> +		} else if (S_ISGITLINK(p->mod_dst)) {
> >>> +			argv_array_pushl(&cp_log.args, "--pretty=  > %s",
> >>> +					 "-1", oid_to_hex(&p->oid_dst), NULL);
> >>> +		} else {
> >>> +			argv_array_pushl(&cp_log.args, "--pretty=  < %s",
> >>> +					 "-1", oid_to_hex(&p->oid_src), NULL);
> >>> +		}
> >>> +		run_command(&cp_log);
> >>> +	}
> >>> +	printf("\n");
> >>> +}
> >>
> >> It looks as if there is a whole lot of `oid_to_hex(&p->oid_src)` in that
> >> function. Together with the realization that we need the abbreviated
> >> version of that at least in one place, and the other realization that we
> >> already call `rev-parse --verify` for both `oid_src` and `oid_dst` in the
> >> caller of this function, it seems to suggest itself that we would actually
> >> want to pass the `--short` option, too, and to capture the output, and
> >> pass it down to `print_submodule_summary()` _instead of_ `missing_src` and
> >> `missing_dst` (e.g. as `src_abbrev` and `dst_abbrev`).
> > 
> > Oh you have mentioned it here too. This seems quite a good approach. I
> > will adopt this.
> > 
> >>> +
> >>> +static void generate_submodule_summary(struct summary_cb *info,
> >>> +				       struct module_cb *p)
> >>> +{
> >>> +	int missing_src = 0;
> >>> +	int missing_dst = 0;
> >>> +	char *displaypath;
> >>> +	int errmsg = 0;
> >>> +	int total_commits = -1;
> >>> +	int is_sm_git_dir = 0;
> >>> +	struct strbuf sm_git_dir_sb = STRBUF_INIT;
> >>> +
> >>> +	if (!info->cached && oideq(&p->oid_dst, &null_oid)) {
> >>> +		if (S_ISGITLINK(p->mod_dst)) {
> >>> +			/*
> >>> +			 * NEEDSWORK: avoid using separate process with
> >>> +			 * the help of the function head_ref_submodule()
> >>
> >> I don't quite understand this comment. There is no `head_ref_submodule()`
> >> function.
> >>
> 
> That NEEDSWORK was added based on Stefan's comment on a previous version
> of Prathamesh's patch. Here it is for reference:
> 
> >> +       if (!info->cached && !oidcmp(&p->oid_dst, &null_oid)) {
> >> +               if (S_ISGITLINK(p->mod_dst)) {
> >> +                       struct child_process cp_rev_parse = CHILD_PROCESS_INIT;
> >> +                       struct strbuf sb_rev_parse = STRBUF_INIT;
> >> +
> >> +                       cp_rev_parse.git_cmd = 1;
> >> +                       cp_rev_parse.no_stderr = 1;
> >> +                       cp_rev_parse.dir = p->sm_path;
> >> +                       prepare_submodule_repo_env(&cp_rev_parse.env_array);
> >> +
> >> +                       argv_array_pushl(&cp_rev_parse.args,
> >> +                                        "rev-parse", "HEAD", NULL);
> >> +                       if (!capture_command(&cp_rev_parse, &sb_rev_parse, 0)) {
> >> +                               strbuf_strip_suffix(&sb_rev_parse, "\n");
> >> +
> >> +                               get_oid_hex(sb_rev_parse.buf, &p->oid_dst);
> >> +                       }
> >> +                       strbuf_release(&sb_rev_parse);
> > 
> > I think this could be replaced via
> > head_ref_submodule(sub->path, callback function, &where_to_store)
> > or is there some trickery going on, that this also works on
> > non-compliant submodules?
> > (Maybe add that as a NEEDSWORK/TODO)
> 
> Ref:
> https://public-inbox.org/git/CAGZ79kaWn9z47Va=VW4R2Aswws1N5n2u4Kvatn73s0YnV0pVqQ@xxxxxxxxxxxxxx/
> 
> A quick search reveals that 'head_ref_submodule' existed during that
> period. On further investigation it seems that 'refs_head_ref' was
> introduced in 62f0b399e0 (refs: add refs_head_ref(), 2017-08-23) and
> 'head_ref_submodule' was made to use it. Later on, in 419221c106 (refs:
> remove dead for_each_*_submodule(), 2017-08-23), 'head_ref-submodule'
> was removed with an advice to use the 'refs_' API for accessing
> submodules.
> 
> > +* Use `refs_` API for accessing submodules. The submodule ref store could
> > +  be obtained with `get_submodule_ref_store()`
> 
> How it applies to our code is something to be looked into, yet.
> 
> >>> +			 */
> >>> +			struct child_process cp_rev_parse = CHILD_PROCESS_INIT;
> >>> +			struct strbuf sb_rev_parse = STRBUF_INIT;
> >>> +
> >>> +			cp_rev_parse.git_cmd = 1;
> >>> +			cp_rev_parse.no_stderr = 1;
> >>> +			cp_rev_parse.dir = p->sm_path;
> >>> +			prepare_submodule_repo_env(&cp_rev_parse.env_array);
> >>> +
> >>> +			argv_array_pushl(&cp_rev_parse.args, "rev-parse",
> >>> +					 "HEAD", NULL);
> >>> +			if (!capture_command(&cp_rev_parse, &sb_rev_parse, 0)) {
> >>> +				strbuf_strip_suffix(&sb_rev_parse, "\n");
> >>> +				get_oid_hex(sb_rev_parse.buf, &p->oid_dst);
> >>> +			}
> >>> +			strbuf_release(&sb_rev_parse);
> >>> +		} else if (S_ISLNK(p->mod_dst) || S_ISREG(p->mod_dst)) {
> >>> +			struct child_process cp_hash_object = CHILD_PROCESS_INIT;
> >>> +			struct strbuf sb_hash_object = STRBUF_INIT;
> >>> +
> >>> +			cp_hash_object.git_cmd = 1;
> >>> +			argv_array_pushl(&cp_hash_object.args, "hash-object",
> >>> +					 p->sm_path, NULL);
> >>> +			if (!capture_command(&cp_hash_object,
> >>> +					     &sb_hash_object, 0)) {
> >>> +				strbuf_strip_suffix(&sb_hash_object, "\n");
> >>> +				get_oid_hex(sb_hash_object.buf, &p->oid_dst);
> >>> +			}
> >>> +			strbuf_release(&sb_hash_object);
> >>
> >> It would probably be shorter, less error-prone, and quicker to use
> >> `index_fd()` directly.
> >>
> >> BTW I am not quite sure that this code does the correct thing in case of a
> >> symlink: it hashes the contents of the symlink target (if it is a file,
> >> otherwise it errors out). But that is hardly an issue introduced by the
> >> conversion, that's just copied from `git-submodule.sh`.
> >>
> >>> +		} else {
> >>> +			if (p->mod_dst)
> >>> +				die(_("unexpected mode %d\n"), p->mod_dst);
> >>
> >> Hmm. This does not match what the shell script version does:
> >>
> >>                         *)
> >>                                 # unexpected type
> >>                                 eval_gettextln "unexpected mode \$mod_dst" >&2
> >>                                 continue ;;
> >>
> >> I think we should also just write the message to `stderr` and continue,
> >> not `die()`.
> >>
> >> In addition to that, I am missing the C code for this case:
> >>
> >>                         000000)
> >>                                 ;; # removed
> >>
> >> It is quite possible that our test suite does not cover this case (or did
> >> the test suite fail for you?). If that is indeed the case, it would be
> >> really good to add a test case as part of this patch series, to gain
> >> confidence in the correctness of the conversion.
> > 
> > The tests passed for me actually. Whether this is covered by the test
> > cases, I am not sure. I will have to check it.
> > 
> >>> +		}
> >>> +	}
> >>> +
> >>> +	strbuf_addstr(&sm_git_dir_sb, p->sm_path);
> >>
> >> I have to admit that I am not loving the name `sm_git_dir_sb`. Why not
> >> `submodule_git_dir`? I guess you copied it from elsewhere in
> >> `submodule--helper.c`...
> >>
> >>> +	if (is_nonbare_repository_dir(&sm_git_dir_sb))
> >>> +		is_sm_git_dir = 1;
> >>
> >> So here, we verify whether there is a repository at `p->sm_path`. I don't
> >> see that in the shell script version:
> >>
> >>                 missing_src=
> >>                 missing_dst=
> >>
> >>                 test $mod_src = 160000 &&
> >>                 ! GIT_DIR="$name/.git" git rev-parse -q --verify $sha1_src^0 >/dev/null &&
> >>                 missing_src=t
> >>
> >>                 test $mod_dst = 160000 &&
> >>                 ! GIT_DIR="$name/.git" git rev-parse -q --verify $sha1_dst^0 >/dev/null &&
> >>                 missing_dst=t
> >>
> >> Let's read a bit further.
> >>
> >>> +
> >>> +	if (is_sm_git_dir && S_ISGITLINK(p->mod_src))
> >>> +		missing_src = verify_submodule_object_name(p->sm_path,
> >>> +							   oid_to_hex(&p->oid_src));
> >>
> >> Ah, and `verify_submodule_object_name()` uses `p->sm_path` as working
> >> directory. But that's not what the shell script version did: it specified
> >> the `GIT_DIR` explicitly.
> >>
> >> And by using the `prepare_submodule_repo_env()` function in
> >> `verify_submodule_object_name()`, we specify `GIT_DIR` implicitly, as I
> >> pointed out in my comment on that function.
> > 
> > Oh so you're saying that it will be better to call
> > 'prepare_submodule_repo_env()' on some variable since we explicitly want to
> > store the path to GIT_DIR?
> > 
> 
> We already call 'prepare_submodule_repo_env' in
> 'verify_submodule_object_name'. So, he's likely saying that
> 'is_sm_git_dir' is unnecessary here.

Understood.

> > It would be of much help if you could explain this part just a little
> > more (for my own sake).
> > 
> >> So I think that `is_sm_git_dir` might be
> > 
> 
> ... unnecessary.
> 
> > I think you missed something here...
> > 
> 
> That's likely what he meant based on what is mentioned above.

Oh okay. Thanks!

> >>> +
> >>> +	if (is_sm_git_dir && S_ISGITLINK(p->mod_dst))
> >>> +		missing_dst = verify_submodule_object_name(p->sm_path,
> >>> +							   oid_to_hex(&p->oid_dst));
> >>> +
> >>> +	displaypath = get_submodule_displaypath(p->sm_path, info->prefix);
> >>> +
> >>> +	if (!missing_dst && !missing_src) {
> >>> +		if (is_sm_git_dir) {
> >>> +			struct child_process cp_rev_list = CHILD_PROCESS_INIT;
> >>> +			struct strbuf sb_rev_list = STRBUF_INIT;
> >>> +			char *range;
> >>> +
> >>> +			if (S_ISGITLINK(p->mod_src) && S_ISGITLINK(p->mod_dst))
> >>> +				range = xstrfmt("%s...%s", oid_to_hex(&p->oid_src),
> >>> +						oid_to_hex(&p->oid_dst));
> >>> +			else if (S_ISGITLINK(p->mod_src))
> >>> +				range = xstrdup(oid_to_hex(&p->oid_src));
> >>> +			else
> >>> +				range = xstrdup(oid_to_hex(&p->oid_dst));
> >>> +
> >>> +			cp_rev_list.git_cmd = 1;
> >>> +			cp_rev_list.dir = p->sm_path;
> >>> +			prepare_submodule_repo_env(&cp_rev_list.env_array);
> >>
> >> Again, due to setting the working directory to `p->sm_path` and
> >> (implicitly, via `prepare_submodule_repo_env()`) `GIT_DIR` to `.git`, I do
> >> not think that we have to guard this block beind `is_sm_git_dir`.
> > 
> >>> +
> >>> +			argv_array_pushl(&cp_rev_list.args, "rev-list",
> >>> +					 "--first-parent", range, "--", NULL);
> >>
> >> Since `argv_array_push()` duplicates the strings, anyway, we can totally
> >> avoid the need for the `range` variable:
> >>
> >> 			if (IS_GITLINK(p->mod_src) && IS_GITLINK(p->mod_dst))
> >> 				argv_array_pushf(&cp_rev_list.args, "%s...%s",
> >> 						 oid_to_hex(&p->oid_src),
> >> 						 oid_to_hex(&p->oid_dst));
> >> 			else
> >> 				argv_array_push(&cp_rev_list.args, IS_GITLINK(p->mod_src) ?
> >> 						oid_to_hex(&p->oid_src) :
> >> 						oid_to_hex(&p->oid_dst));
> >>
> >>> +			if (!capture_command(&cp_rev_list, &sb_rev_list, 0)) {
> >>> +				if (sb_rev_list.len)
> >>> +					total_commits = count_lines(sb_rev_list.buf,
> >>> +								    sb_rev_list.len);
> >>
> >> That's actually not necessary. `git rev-list --count` will give you a nice
> >> number, no need to capture a potentially large amount of memory only to
> >> count the lines.
> >>
> >> This may also make the patch obsolete that makes `count_lines()` public.
> > 
> > Therefore we eliminate count_lines() from here and instead do a 'git
> > rev-list --count'?
> > 
> 
> Yes.
> 
> >>> +				else
> >>> +					total_commits = 0;
> >>> +			}
> >>
> >>> +
> >>> +			free(range);
> >>> +			strbuf_release(&sb_rev_list);
> >>> +		}
> >>> +	} else {
> >>> +		errmsg = 1;
> >>> +	}
> >>
> >> I am missing the equivalent for these lines here:
> >>
> >>                 t,)
> >>                         errmsg="$(eval_gettext "  Warn: \$display_name doesn't contain commit \$sha1_src")"
> >>                         ;;
> >>                 ,t)
> >>                         errmsg="$(eval_gettext "  Warn: \$display_name doesn't contain commit \$sha1_dst")"
> >>                         ;;
> >>                 t,t)
> >>                         errmsg="$(eval_gettext "  Warn: \$display_name doesn't contain commits \$sha1_src and \$ ha1_dst")"
> >>                         ;;
> > 
> > I will add them.
> > 
> 
> I think they're already there in the 'print_submodule_summary' function
> above.
> 
> >> I am not quite sure whether it is a good idea to leave it to the
> >> `print_submodule_summary()` function to generate the `errmsg`. I think I'd
> >> rather have it a `char *` than an `int`.
> > 
> > Would it be better to add these error messages in
> > 'prepare_submodule_summary()'?
> 
> No. He's likely saying that instead of setting `errmsg` to 1 and
> constructing the error message in `print_submodule_summary` we should
> be having the error messages in `generate_submodule_summary` and pass
> them to `print_submodule_summary` instead of passing an int.

So this means that we will have to generate the above mentioned messages
in 'generate_submodule_summary()' and then pass them as char into
'print_submodule_summary()' rather than generating them in the latter.
This is what we want right?

> > If we have error messages as integers
> > then we will simply
> > 
> 
> You missed something here. ;)

I actually don't know what I was trying to write here.

> >>> +
> >>> +	print_submodule_summary(info, errmsg, total_commits,
> >>> +				missing_src, missing_dst,
> >>> +		      		displaypath, is_sm_git_dir, p);
> >>> +
> >>> +	free(displaypath);
> >>> +	strbuf_release(&sm_git_dir_sb);
> >>> +}
> >>> +
> >>> +static void prepare_submodule_summary(struct summary_cb *info,
> >>> +				      struct module_cb_list *list)
> >>> +{
> >>> +	int i;
> >>> +	for (i = 0; i < list->nr; i++) {
> >>> +		struct module_cb *p = list->entries[i];
> >>> +		struct child_process cp_rev_parse = CHILD_PROCESS_INIT;
> >>> +
> >>> +		if (p->status == 'D' || p->status == 'T') {
> >>> +			generate_submodule_summary(info, p);
> >>> +			continue;
> >>> +		}
> >>> +
> >>> +		if (info->for_status) {
> >>> +			char *config_key;
> >>
> >> Since the `config_key` is only used within the `if()` block it would be
> >> better to declare it within that block.
> >>
> >>> +			const char *ignore_config = "none";
> >>
> >> Since the only value we ever care about is "all", how about turning this
> >> into an `int`, setting it to `0` here, and later assigning it to
> >> `!strcmp(value, "all")` and `!strcmp(sub->ignore, "all")`, respectively?
> > 
> > Alright will do!
> > 
> >> I mean, I get it. Unix shell scripts are all about passing around text.
> >> And it is much easier to just translate that to C faithfully. But that
> >> does not make it good C code. C has data types, and proper C code makes
> >> use of that.
> >>
> >>> +			const char *value;
> >>
> >> If you want to save on lines, you can cuddle this together with other
> >> declarations of the same type. Even so, it could be scoped more narrowly.
> >>
> >>> +			const struct submodule *sub = submodule_from_path(the_repository,
> >>> +									  &null_oid,
> >>> +									  p->sm_path);
> >>> +
> >>> +			if (sub && p->status != 'A') {
> >>
> >> Good. The shell script version _always_ retrieved the `.ignore` config
> >> value, even if the `status` is `A`. Your version is much better.
> >>
> >> But why bother calling `submodule_from_path()` if the status is `A`?
> > 
> > What exactly does a status of 'A' or 'T' mean? I mean I know what we are
> > doing but what exactly do these translate into?
> > 
> 
> Its interesting you understood it without knowing what 'A' and 'T'
> meant. Anyways, if you take a look at the documentation of
> 'diff-index'[1] which provides us the `status` you'll know that:

I just assumed that they must mean something to something and read on
the rest of the code. I meant to ask the full-form of 'A', 'T', etc.
You have provided it below and this makes things even more clear.
Thanks!

> > Possible status letters are:
> > 
> >     A: addition of a file
> > 
> >     C: copy of a file into a new one
> > 
> >     D: deletion of a file
> > 
> >     M: modification of the contents or mode of a file
> > 
> >     R: renaming of a file
> > 
> >     T: change in the type of the file
> > 
> >     U: file is unmerged (you must complete the merge before it can be committed)
> > 
> >     X: "unknown" change type (most probably a bug, please report it)
> > 
> 
> [1]: https://git-scm.com/docs/git-diff-index
> 
> >> I could actually see the `const struct submodule *sub;` declaration be
> >> pulled out of this scope, and combining the `if (info->for_status &&
> >> p->status != 'A'), and the moving the assignment of `sub` into the `else
> >> if ((sub = submodule_from_path(r, &null_oid, p->sm_path)) &&
> >> sub->ignore)`.
> >>
> >> That would save us one entire indentation level.
> > 
> > That seems a good approach! I will try this out.
> > 
> >>> +				config_key = xstrfmt("submodule.%s.ignore",
> >>> +						     sub->name);
> >>> +				if (!git_config_get_string_const(config_key, &value))
> >>> +					ignore_config = value;
> >>> +				else if (sub->ignore)
> >>> +					ignore_config = sub->ignore;
> >>> +
> >>> +				free(config_key);
> >>> +				if (!strcmp(ignore_config, "all"))
> >>> +					continue;
> >>> +			}
> >>> +		}
> >>> +
> >>> +		/* Also show added or modified modules which are checked out */
> >>> +		cp_rev_parse.dir = p->sm_path;
> >>> +		cp_rev_parse.git_cmd = 1;
> >>> +		cp_rev_parse.no_stderr = 1;
> >>> +		cp_rev_parse.no_stdout = 1;
> >>> +
> >>> +		argv_array_pushl(&cp_rev_parse.args, "rev-parse",
> >>> +				 "--git-dir", NULL);
> >>> +
> >>> +		if (!run_command(&cp_rev_parse))
> >>
> >> I wonder whether we really need to waste an entire spawned process on
> >> figuring out whether `p->sm_path` refers to an active repository. Wouldn't
> >> `is_submodule_active(r, p->sm_path)` fulfill the same purpose?
> > 
> > Yep! This is correct. I will change.
> > 
> >>> +			generate_submodule_summary(info, p);
> >>> +	}
> >>> +}
> >>> +
> >>> +static void submodule_summary_callback(struct diff_queue_struct *q,
> >>> +				       struct diff_options *options,
> >>> +				       void *data)
> >>> +{
> >>> +	int i;
> >>> +	struct module_cb_list *list = data;
> >>> +	for (i = 0; i < q->nr; i++) {
> >>> +		struct diff_filepair *p = q->queue[i];
> >>> +		struct module_cb *temp;
> >>> +
> >>> +		if (!S_ISGITLINK(p->one->mode) && !S_ISGITLINK(p->two->mode))
> >>> +			continue;
> >>> +		temp = (struct module_cb*)malloc(sizeof(struct module_cb));
> >>> +		temp->mod_src = p->one->mode;
> >>> +		temp->mod_dst = p->two->mode;
> >>> +		temp->oid_src = p->one->oid;
> >>> +		temp->oid_dst = p->two->oid;
> >>> +		temp->status = p->status;
> >>> +		temp->sm_path = xstrdup(p->one->path);
> >>> +
> >>> +		ALLOC_GROW(list->entries, list->nr + 1, list->alloc);
> >>> +		list->entries[list->nr++] = temp;
> >>> +	}
> >>> +}
> >>> +
> >>> +static const char *get_diff_cmd(enum diff_cmd diff_cmd)
> >>> +{
> >>> +	switch (diff_cmd) {
> >>> +	case DIFF_INDEX: return "diff-index";
> >>> +	case DIFF_FILES: return "diff-files";
> >>> +	default: BUG("bad diff_cmd value %d", diff_cmd);
> >>> +	}
> >>> +}
> >>> +
> >>> +static int compute_summary_module_list(char *head,
> >>> +				         struct summary_cb *info,
> >>> +				         enum diff_cmd diff_cmd)
> >>> +{
> >>> +	struct argv_array diff_args = ARGV_ARRAY_INIT;
> >>> +	struct rev_info rev;
> >>> +	struct module_cb_list list = MODULE_CB_LIST_INIT;
> >>> +
> >>> +	argv_array_push(&diff_args, get_diff_cmd(diff_cmd));
> >>> +	if (info->cached)
> >>> +		argv_array_push(&diff_args, "--cached");
> >>> +	argv_array_pushl(&diff_args, "--ignore-submodules=dirty", "--raw",
> >>> +			 NULL);
> >>> +	if (head)
> >>> +		argv_array_push(&diff_args, head);
> >>> +	argv_array_push(&diff_args, "--");
> >>> +	if (info->argc)
> >>> +		argv_array_pushv(&diff_args, info->argv);
> >>> +
> >>> +	git_config(git_diff_basic_config, NULL);
> >>> +	init_revisions(&rev, info->prefix);
> >>> +	rev.abbrev = 0;
> >>> +	precompose_argv(diff_args.argc, diff_args.argv);
> >>> +
> >>> +	diff_args.argc = setup_revisions(diff_args.argc, diff_args.argv,
> >>> +					 &rev, NULL);
> >>> +	rev.diffopt.output_format = DIFF_FORMAT_NO_OUTPUT | DIFF_FORMAT_CALLBACK;
> >>> +	rev.diffopt.format_callback = submodule_summary_callback;
> >>> +	rev.diffopt.format_callback_data = &list;
> >>> +
> >>> +	if (!info->cached) {
> >>> +		if (diff_cmd ==  DIFF_INDEX)
> >>
> >> Please substitute the double-space by a single one.
> > 
> > Will do!
> > 
> >>> +			setup_work_tree();
> >>> +		if (read_cache_preload(&rev.diffopt.pathspec) < 0) {
> >>> +			perror("read_cache_preload");
> >>> +			return -1;
> >>> +		}
> >>> +	} else if (read_cache() < 0) {
> >>> +		perror("read_cache");
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	if (diff_cmd == DIFF_INDEX)
> >>> +		run_diff_index(&rev, info->cached);
> >>> +	else
> >>> +		run_diff_files(&rev, 0);
> >>> +	prepare_submodule_summary(info, &list);
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static int module_summary(int argc, const char **argv, const char *prefix)
> >>> +{
> >>> +	struct summary_cb info = SUMMARY_CB_INIT;
> >>> +	int cached = 0;
> >>> +	int for_status = 0;
> >>> +	int quiet = 0;
> >>> +	int files = 0;
> >>> +	int summary_limit = -1;
> >>> +	struct child_process cp_rev = CHILD_PROCESS_INIT;
> >>> +	struct strbuf sb = STRBUF_INIT;
> >>> +	enum diff_cmd diff_cmd = DIFF_INDEX;
> >>> +	int ret;
> >>> +
> >>> +	struct option module_summary_options[] = {
> >>> +		OPT__QUIET(&quiet, N_("Suppress output of summarising submodules")),
> >>> +		OPT_BOOL(0, "cached", &cached,
> >>> +			 N_("Use the commit stored in the index instead of the submodule HEAD")),
> >>> +		OPT_BOOL(0, "files", &files,
> >>> +			 N_("To compare the commit in the index with that in the submodule HEAD")),
> >>> +		OPT_BOOL(0, "for-status", &for_status,
> >>> +			 N_("Skip submodules with 'ignore_config' value set to 'all'")),
> >>> +		OPT_INTEGER('n', "summary-limit", &summary_limit,
> >>> +			     N_("Limit the summary size")),
> >>> +		OPT_END()
> >>> +	};
> >>> +
> >>> +	const char *const git_submodule_helper_usage[] = {
> >>> +		N_("git submodule--helper summary [<options>] [commit] [--] [<path>]"),
> >>> +		NULL
> >>> +	};
> >>> +
> >>> +	argc = parse_options(argc, argv, prefix, module_summary_options,
> >>> +			     git_submodule_helper_usage, 0);
> >>> +
> >>> +	if (!summary_limit)
> >>> +		return 0;
> >>> +
> >>> +	cp_rev.git_cmd = 1;
> >>> +	argv_array_pushl(&cp_rev.args, "rev-parse", "-q", "--verify",
> >>> +			 argc ? argv[0] : "HEAD", NULL);
> >>
> >> Oy. Why not simply call `get_oid()`? No need to spawn a new process.
> > 
> > Then everytime we need 'rev-parse', we simply call 'get_oid()'? That
> > will save us a ton of processes?
> > 
> > But I think we do need to capture the output of 'git rev-parse --verify
> > ....' so I think it will backfire to use 'get_oid()' or am I just being
> > too dumb and not catching on something?
> > 
> 
> I'll leave this for others to answer.

I will resolve this one after Dscho answers then.

> >>> +
> >>> +	if (!capture_command(&cp_rev, &sb, 0)) {
> >>> +		strbuf_strip_suffix(&sb, "\n");
> >>> +		if (argc) {
> >>> +			argv++;
> >>> +			argc--;
> >>> +		}
> >>> +	} else if (!argc || !strcmp(argv[0], "HEAD")) {
> >>> +		/* before the first commit: compare with an empty tree */
> >>> +		struct stat st;
> >>> +		struct object_id oid;
> >>> +		if (fstat(0, &st) < 0 || index_fd(&the_index, &oid, 0, &st, 2,
> >>> +						  prefix, 3))
> >>> +			die("Unable to add %s to database", oid.hash);
> >>
> >> Umm. The original reads:
> >>
> >>                 # before the first commit: compare with an empty tree
> >>                 head=$(git hash-object -w -t tree --stdin </dev/null)
> >>
> >> It does not actually read from `stdin`. It reads from `/dev/null`,
> >> redirected to the input. And what it _actually_ does is to generate the
> >> OID of the empty tree.
> >>
> >> But we already _have_ the OID of the empty tree! It's
> >> `the_hash_algo->empty_tree`.
> > 
> > I did not know this 'the_hash_algo'. I will use it. Thanks! :)
> > 
> >> I hope that this is covered by the test suite. Please check that. The test
> >> would succeed with your version, but only because tests are run with
> >> `stdin` redirected from `/dev/null` by default.
> > 
> > I guess yes. My work passed because the tests are written this way.
> > 
> >>> +		strbuf_addstr(&sb, oid_to_hex(&oid));
> >>> +		if (argc) {
> >>> +			argv++;
> >>> +			argc--;
> >>> +		}
> >>> +	} else {
> >>> +		strbuf_addstr(&sb, "HEAD");
> >>> +	}
> >>
> >> The conversion to C would make for a fine excuse to simplify the logic.
> > 
> > This was kind of like the 'shift' in shell. What equivalent do you
> > suggest?
> > 
> 
> I think that's just a general comment after the other comments found
> just above about simplifying things.

Alright. But I do have to simplify the logic right?

> >>> +	if (files) {
> >>> +		if (cached)
> >>> +			die(_("--cached and --files are mutually exclusive"));
> >>> +		diff_cmd = DIFF_FILES;
> >>> +	}
> >>> +
> >>> +	info.argc = argc;
> >>> +	info.argv = argv;
> >>> +	info.prefix = prefix;
> >>> +	info.cached = !!cached;
> >>> +	info.for_status = !!for_status;
> >>> +	info.quiet = quiet;
> >>> +	info.files = files;
> >>> +	info.summary_limit = summary_limit;
> >>> +
> >>> +	ret = compute_summary_module_list((diff_cmd == DIFF_FILES) ? NULL : sb.buf,
> >>> +					   &info, diff_cmd);
> >>
> >> It would be better to pass the OID as `struct object_id *`, not as string.
> > 
> > Will do!
> > 
> >> Other than that, this patch nicely follows previous conversions from Unix
> >> shell scripts to C.
> >>
> >> Well done,
> >> Johannes
> > 
> > Thank you! It was a highly detailed review! I am still learning tons of
> > stuff about Git's code and such a review does help a lot! :)
> > 

> Hope this helps,

It surely did! Thanks Kaartic :0