Re: [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile

Taylor Blau <me@xxxxxxxxxxxx> · Fri, 25 Oct 2024 17:18:51 -0400

On Fri, Oct 25, 2024 at 02:58:06AM -0400, Jeff King wrote:
> This patch fixes a regression in b1b8dfde69 (finalize_object_file():
> implement collision check, 2024-09-26) where fetching a v1 pack idx file
> over the dumb-http protocol would cause the fetch to fail.

Makes sense, and matches our discussion from elsewhere on the list
sometime last week.

> There are a few options to fix it:
>
>   - we could teach index-pack a command-line option to ignore only pack
>     idx collisions, and use it when the dumb-http code invokes
>     index-pack. This would be an awkward thing to expose users to and
>     would involve a lot of boilerplate to get the option down to the
>     collision code.
>
>   - we could delete the remote .idx file right before running
>     index-pack. It should be redundant at that point (since we've just
>     downloaded the matching pack). But it feels risky to delete
>     something from our own .git/objects based on what the other side has
>     said. I'm not entirely positive that a malicious server couldn't lie
>     about which pack-$hash.idx it has and get us to delete something
>     precious.
>
>   - we can stop co-mingling the downloaded idx files in our local
>     objects directory. This is a slightly bigger change but I think
>     fixes the root of the problem more directly.
>
> This patch implements the third option. The big design questions are:
> where do we store the downloaded files, and how do we manage their
> lifetimes?

Yep, the third option is definitely the most sensible here.

> I think the right solution here is probably some more complex cache
> management system: download the remote .idx files to their own storage
> directory, mark them as "seen" when we get their matching pack (to avoid
> re-downloading even if we repack), and then delete them when the
> server's objects/info/refs no longer mentions them.
>
> But since the dumb http protocol is so ancient and so inferior to the
> smart http protocol, I don't think it's worth spending a lot of time
> creating such a system. For this patch I'm just downloading the idx
> files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
> deleted when we exit (and due to the name, any we miss due to a crash,
> etc, should eventually be removed by "git gc" runs based on timestamps).
>
> That is slightly worse for one case: [...]

OK. I think the more robust version of the solution you identified here
makes sense and is probably the right thing to do if we wanted to spend
more time and effort on fixing this part of the dumb HTTP protocol.

But I'm willing to believe that there are so few users of the protocol
at this point that it's almost certainly not worth the effort. So I'm
supportive of the stopping point that you chose here.

> diff --git a/http.c b/http.c
> index d59e59f66b..9642cad2e3 100644
> --- a/http.c
> +++ b/http.c
> @@ -19,6 +19,7 @@
>  #include "string-list.h"
>  #include "object-file.h"
>  #include "object-store-ll.h"
> +#include "tempfile.h"
>
>  static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
>  static int trace_curl_data = 1;
> @@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
>  	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
>  	url = strbuf_detach(&buf, NULL);
>
> -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> -	tmp = strbuf_detach(&buf, NULL);
> +	/*
> +	 * Don't put this into packs/, since it's just temporary and we don't
> +	 * want to confuse it with our local .idx files.  We'll generate our
> +	 * own index if we choose to download the matching packfile.

I actually think putting it in $GIT_DIR/objects/pack is fine and perhaps
preferable, since that's where we put existing in-progress temporary
files. We'll ignore anything that doesn't end with "*.idx", so I think
it would be fine.

I don't feel strongly about it. It just feels a little weird to stick
these temporary files in one place, and stick other (similar) temporary
flies in another.

> diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
> index 3968b82260..706540ab74 100755
> --- a/t/t5550-http-fetch-dumb.sh
> +++ b/t/t5550-http-fetch-dumb.sh
> @@ -335,7 +335,7 @@ test_expect_success 'fetch can handle previously-fetched .idx files' '
>  	count_fetches 1 pack one.trace &&
>  	GIT_TRACE_CURL=$PWD/two.trace git --git-dir=clone_packed_branches.git \
>  		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2 &&
> -	count_fetches 0 idx two.trace &&
> +	count_fetches 1 idx two.trace &&
>  	count_fetches 1 pack two.trace
>  '
>
> @@ -521,4 +521,14 @@ test_expect_success 'fetching via http alternates works' '
>  	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
>  '
>
> +test_expect_success 'dumb http can fetch index v1' '
> +	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
> +	git init --bare "$server" &&
> +	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
> +	git -C "$server" -c pack.indexVersion=1 gc &&
> +
> +	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
> +	git -C idx-v1 fsck
> +'

Very nicely done.

Thanks,
Taylor