Re: [PATCH v5 2/2] gitweb: append short hash ids to snapshot files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 12 Oct 2009, Mark Rada wrote:
> On 09-10-05 6:06 AM, Jakub Narebski wrote:
> 
>>>  	my $o_git_dir = $git_dir;
>>>  	my $retval = undef;
>>>  	$git_dir = "$projectroot/$project";
>>> -	if (open my $fd, "-|", git_cmd(), "rev-parse", "--verify", "HEAD") {
>>> -		my $head = <$fd>;
>>> +	if (open my $fd, '-|', git_cmd(), 'rev-parse', '--verify', $hash) {
>>> +		$hash = <$fd>;
>>>  		close $fd;
>>> -		if (defined $head && $head =~ /^([0-9a-fA-F]{40})$/) {
>>> +		if (defined $hash && $hash =~ /^([0-9a-fA-F]{40})$/) {
>>> +			$retval = $1;
>>> +		}
>> 
>> I guess that you use "$retval = $1;" instead of just "$retval = $hash;"
>> because of similarities with git_get_short_hash, isn't it?  Or it is just
>> following earlier code?
> 
> Yeah, it is following earlier code, I did not change it, 

Ah, that's O.K.

Although if you plan refactoring this, you might fix this bit of
inefficiency (no need for capturing).

> though the diff seems to think I added it, perhaps this is a bug with
> diff? 

No, that is just diff being ambiguous, as there is more than one way
to generate diff of changes.  Perhaps patience diff would produce
better results, perhaps not.  It might mean that refactoring common
code is needed ;-)))))

>>> +	}
>>> +	if (defined $o_git_dir) {
>>> +		$git_dir = $o_git_dir;
>>> +	}
>>> +	return $retval;
>>> +}
>>> +
>>> +# try and get a shorter hash id
>>> +sub git_get_short_hash {
>>> +	my $project = shift;
>>> +	my $hash = shift;
>>> +	my $o_git_dir = $git_dir;
>>> +	my $retval = undef;
>>> +	$git_dir = "$projectroot/$project";
>>> +	if (open my $fd, '-|', git_cmd(), 'rev-parse', '--short=7', $hash) {
>>> +		$hash = <$fd>;
>>> +		close $fd;
>>> +		if (defined $hash && $hash =~ /^([0-9a-fA-F]{7,})$/) {
>>>  			$retval = $1;
>>>  		}
>>>  	}
>> 
>> Note that git_get_full_hash (which additionally does verification) and
>> git_get_short_hash share much of code.  Perhaps it might be worth to
>> avoid code duplication somehow?  On the other hand it might be not worth
>> to complicate code by trying to extract common parts here...
> 
> Hmm, I think it might be a good idea to just write a generic routine
> that takes a hash length as an extra parameter. Then the short and full
> hash fetching routines can just acts as wrappers.

Well, git_get_full_hash uses --verify, git_get_short_hash uses --short=7
(but perhaps it should also use --verify).

BTW. I think that checking that output of git-rev-parse is (shortened)
SHA-1 predates usage of --verify; with --verify is, I think, not
necessary:

 --verify
     The parameter given must be usable as a single, valid object name.
     Otherwise barf and abort.

>>> @@ -5203,6 +5228,13 @@ sub git_snapshot {
>>>  		die_error(400, 'Object is not a tree-ish');
>>>  	}
>>>  
>>> +
>>> +	my $full_hash = git_get_full_hash($project, $hash);
>>> +	if ($full_hash =~ /^$hash/) {

BTW, we can use  

        if (index($full_hash, $hash) == 0) {

instead.  BTW, $hash could contain regexp metacharacters like '.'
('dead.beef' is a valid branch name), so it should be

	if ($full_hash =~ /^\Q$hash/) {

if you want to use regexp (it might be easier to read).  

Or you can encapsulate this into is_substring() subroutine, but that
might be (well, almost surely is) overkill...

>>> +		$hash = git_get_short_hash($project, $hash);
>>> +	} else {
>>> +		$hash .= '-' . git_get_short_hash($project, $hash);
>>> +	}
>> 
>> I think we might want to avoid calling git_get_full_hash (and extra call
>> to "git rev-parse" command, which is extra fork) if we know in advance
>> that  $full_hash =~ /^$hash/  can't be true, i.e. if $hash doesn't match
>> /^[0-9a-fA-F]+$/.  That would require that we continue to use $hash
>> and not $full_hash, see comment for the chunk below.
>> 
>> BTW do you think that having better name (nicer name in the case
>> when $hash is full SHA-1, or name which describes exact version as 
>> in the case when $hash is branch name or just 'HEAD') is worth
>> slight extra cost of "git rev-parse --abbrev=7"?
> 
> Hmm, yeah, some optimization will have to occur in that block of
> code. Though, my reason for that extra call to rev-parse to get the
> short hash is so I can get git to find the shortest unique SHA-1,
> instead of just assuming that it will always be of length 7. I think
> the cost is not too bad considering a snapshot will have to be generated
> and probably take way more time. Though, warthog9 has some caching
> patches that work, so maybe it isn't worth it. Hmm...

What I meant here that unless $hash =~ /^[0-9a-fA-F]{7,}$/ then we 
always use git_get_short_hash, as $full_hash wouldn't match /^$hash/
($hash wouldn't be a prefix of $full_hash).  We don't need to
calculate git_get_full_hash which wouldn't be used (see also comment
below, though).

> 
>>>  	my $name = $project;
>>>  	$name =~ s,([^/])/*\.git$,$1,;
>>>  	$name = basename($name);
>>> @@ -5213,7 +5245,7 @@ sub git_snapshot {
>>>  	$cmd = quote_command(
>>>  		git_cmd(), 'archive',
>>>  		"--format=$known_snapshot_formats{$format}{'format'}",
>>> -		"--prefix=$name/", $hash);
>>> +		"--prefix=$name/", $full_hash);
>> 
>> Why this change?
> 
> Since $hash can change by becoming something like 'HEAD-43ab5f2c' due to
> the process of creating the better name we need to pass something to
> `archive' that will be valid, and $full_hash will be valid.

Errr... why it is called _$hash_ then, if it can be not hash?  Wouldn't
it be better to manipulate $name here?

I think this fragment should be extracted into snapshot_name() subroutine,
which result would be used both as proposed snapshot name, and as prefix
to be used.

> 
>>> +test_description='gitweb as standalone script (parsing script output).
>>> +
>>> +This test runs gitweb (git web interface) as a CGI script from the
>>> +commandline, and checks that it produces the correct output, either
>>> +in the HTTP header or the actual script output.'
>> 
>> Currently all tests here are about 'snapshot' action.  They are quite
>> specific, and they do require some knowledge about chosen archive format.

That is not true, as I haven't noticed at this point that you are 
examining only HTTP headers... but not the HTTP status but other headers.

[...]
>>> +	grep ".git-$ID.tar.gz" gitweb.output
>> 
>> Here had to think a bit that gitweb.output consists both of HTTP headers,
>> and of response body, and you are grepping here in the HTTP headers part.
>> It would be better solution for gitweb_run to split gitweb.output into
>> gitweb.headers and gitweb.body (perhaps if requested by setting some
>> variable, e.g. GITWEB_SPLIT_OUTPUT).
>> 
>> It can be done using the following lines:
>> 
>> 	sed    -e '/^\r$/'      <gitweb.output>gitweb.headers

That was meant to be

>> 	sed    -e '/^\r$/q'     <gitweb.output >gitweb.headers

which means print (the default action) until single empty CRLF terminated
line, which ends HTTP headers.

>> 	sed -n -e '0,/^\r$/!p'  <gitweb.output>gitweb.body
>> 
>> 	# gitweb.headers is used to parse http headers
>> 	# gitweb.body is response without http headers
>> 
>> But the second one uses GNU sed extension; I don't know how to write
>> it in more portable way.
> 
> I like this and will try to find a way of setting this up without using
> GNU extensions.

Well, we do know that there always would be at least one header, so we
can use:

 	sed -n -e '1,/^\r$/!p'  <gitweb.output >gitweb.body

But I'd prefer that somebody better versed in sed would come up with
solution to extract everything up to first empty CRLF terminated line,
and everything from such line till the end of file.

>> Note that to avoid ambiguities currently gitweb uses refs/heads/master
>> and refs/tags/SnapshotFileTests... but dealing with this issue should be
>> left, I think, for separate commit.
>> 
> 
> I do not understand what ambiguity exists, can you please explain this?

The problem I was thinking about is the following.

In commit bf901f8 (gitweb: disambiguate heads and tags withs the same
name, 2007-12-15) started to use refs/heads/<branch> and refs/tags/<tag>
instead of <branch> and <tag> because there was problem when there were
tag and branch with the same name.

The problem is that we can't use '/' in proposed snapshot file name,
and we shouldn't use '/' in git-archive prefix.  So we can't simply
use (as you proposed) 

  $hash . '-' . git_get_short_hash($project, $hash);

as a snapshot basename suffix, because $hash can be 'refs/heads/master',
or it can be 'mr/gitweb-snapshot'.  What to do, what to do...


Also if $hash is refs/tags/v1.6.0, we don't really need shortened SHA-1
suffix.  

Alternative to checking for refs/tags/ prefix would be to use
git-describe output... perhaps.

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]