Re: [GSoC][PATCH 2/2] refs: add name and content check for file backend

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 31 May 2024 07:23:31 -0700

shejialuo <shejialuo@xxxxxxxxx> writes:

> +static int files_fsck_refs_content(const char *refs_check_dir,
> +				struct dir_iterator *iter)
> +{
> +	struct strbuf ref_content = STRBUF_INIT;

The caller makes sure that this gets called only on a regular file,
so ...

> +	if (strbuf_read_file(&ref_content, iter->path.buf, 0) < 0) {

... the use of strbuf_read_file() is fine (i.e. we do not have to
worry about the iter->path to be pointing at a symbolic link) here.

> +		error(_("%s/%s: unable to read the ref"), refs_check_dir, iter->basename);
> +		goto clean;
> +	}
> +
> +	/*
> +	 * Case 1: check if the ref content length is valid and the last
> +	 * character is a newline.
> +	 */
> +	if (ref_content.len != the_hash_algo->hexsz + 1 ||
> +			ref_content.buf[ref_content.len - 1] != '\n') {

Funny indentation.

	if (ref_content.len != the_hash_algo->hexsz + 1 ||
	    ref_content.buf[ref_content.len - 1] != '\n') {

Also, we do not want {braces} around a single statement block.

In any case, the two checks are good ONLY for regular refs and not
for symbolic refs.  The users are free to create symbolic refs next
to their branches, e.g. here is a way to say, "among the maintenance
tracks maint-2.30, maint-2.31, ... maint-2.44, maint-2.45, what I
consider the primary maintenance track is currently maint-2.45":

    $ git symbolic-ref refs/heads/maint
    refs/heads/maint-2.45

and if your directory walk encounters such a symbolic ref, your
strbuf_read_file() will read something like:

    $ cat .git/refs/heads/maint
    ref: refs/heads/maint-2.45

Also, the caller also needs to be prepared to find a real symbolic
link that is used as a symbolic ref, but for them you'd need to do a
readlink() and the expected contents would not have "ref: " prefix,
so the code to implement actual check would have to be different.
The way how refs/files-backend.c:read_ref_internal() handles S_ISLNK
would be illuminating.

> +		goto failure;
> +	}
> +	/*
> +	 * Case 2: the content should be range of [0-9a-f].
> +	 */
> +	for (size_t i = 0; i < the_hash_algo->hexsz; i++) {
> +		if (!isdigit(ref_content.buf[i]) &&
> +				(ref_content.buf[i] < 'a' || ref_content.buf[i] > 'f')) {
> +			goto failure;

I do not think it is a good idea to suddenly redefine what a valid
way to write object names in a loose ref file after ~20 years.
Given the popularity of Git, I would not be surprised at all if a
third-party tool or two have been writing their own refs with
uppercase hexadecimal and we have been happily using them as
everybody expects.  It is a good idea to be strict when you are a
producer, and we do write object names always in lowercase hex, but
we are lenient when consuming what is possibly written by others.
As long as the hexadecimal string names an existing object
(otherwise we have a dangling ref which would be another kind of
error, I presume), it should be at most FSCK_WARN when it does not
look like what _we_ wrote (e.g., if it uses uppercase hex, or if it
has extra bytes after the 40-hex (or 64-hex) other than the final
LF).  If it is shorter, of course that is an outright FSCK_ERROR.

How well does this interact with the fsck error levels (aka
fsck_msg_type), by the way?  It should be made to work well if the
current design does not.

> +		}
> +	}
> +
> +	strbuf_release(&ref_content);
> +	return 0;
> +
> +failure:
> +	error(_("%s/%s: invalid ref content"), refs_check_dir, iter->basename);

In addition, when we honor fsck error levels, such an unconditional
call to "error" may become an issue.  If the user says a certain
kind of anomaly is tolerated by setting a specific finding to
FSCK_IGNORE, we should be silent about our finding.

> +clean:
> +	strbuf_release(&ref_content);
> +	return -1;
> +}
> +
> +static int files_fsck_refs(struct ref_store *ref_store,
> +				const char* refs_check_dir,
> +				files_fsck_refs_fn *fsck_refs_fns)
> +{
> +	struct dir_iterator *iter;
> +	struct strbuf sb = STRBUF_INIT;
> +	int ret = 0;
> +	int iter_status;
> +
> +	strbuf_addf(&sb, "%s/%s", ref_store->gitdir, refs_check_dir);
> +
> +	iter = dir_iterator_begin(sb.buf, 0);
> +
> +	/*
> +	 * The current implementation does not care about the worktree, the worktree
> +	 * may have no refs/heads or refs/tags directory. Simply return 0 now.
> +	*/
> +	if (!iter) {
> +		return 0;
> +	}
> +
> +	while ((iter_status = dir_iterator_advance(iter)) == ITER_OK) {
> +		if (S_ISDIR(iter->st.st_mode)) {
> +			continue;
> +		} else if (S_ISREG(iter->st.st_mode)) {
> +			for (files_fsck_refs_fn *fsck_refs_fn = fsck_refs_fns;
> +					*fsck_refs_fn; fsck_refs_fn++) {

Funny indentation (again---the patch 2/2 has funny "two extra HT"
indentation style all over the place).  Write it like so:

			for (files_fsck_refs_fn *fn = fsck_refs_fns;
			     fn;
                             fn++)

> +				ret |= (*fsck_refs_fn)(refs_check_dir, iter);

A pointer to a function does not need to be dereferenced explicitly
like so.  Also, writing

				if (fn(refs_check_dir, iter))
					ret = -1;

would be more consistent with the rest of the code around here,
which does not OR int ret but explicitly set it to -1 when any
helper function detects an error.

> +			}
> +		} else {
> +			error(_("unexpected file type for '%s'"), iter->basename);
> +			ret = -1;

This is wrong.  A symbolic link is a valid symbolic ref, even though
we no longer create such a symbolic ref by default.

> +		}
> +	}
> +
> +	if (iter_status != ITER_DONE) {
> +		ret = -1;
> +		error(_("failed to iterate over '%s'"), sb.buf);
> +	}
> +
> +	strbuf_release(&sb);
> +
> +	return ret;
> +}
> +
> +static int files_fsck(struct ref_store *ref_store)
> +{
> +	int ret = 0;
> +
> +	files_fsck_refs_fn fsck_refs_fns[] = {
> +		files_fsck_refs_name,
> +		files_fsck_refs_content,
> +		NULL
> +	};
> +
> +	ret = files_fsck_refs(ref_store, "refs/heads",fsck_refs_fns)

Missing SP after a comma.

> +	    | files_fsck_refs(ref_store, "refs/tags", fsck_refs_fns);

Why only these two hierarchies?  Shouldn't it also be checking the
remote tracking branches and notes?

> +	return ret;
> +}
> +
>  struct ref_storage_be refs_be_files = {
>  	.name = "files",
>  	.init = files_ref_store_create,