Re: git-archive and tar options

Neal Kreitzinger <nkreitzinger@xxxxxxxxx> · Mon, 18 Jul 2011 14:31:58 -0500

On 7/15/2011 3:59 PM, René Scharfe wrote:
Am 15.07.2011 01:30, schrieb Junio C Hamano:
Jeff King<peff@xxxxxxxx>  writes:

Why?

The tree you are writing out that way look very different from
 what is recorded in the commit object. What's the point of
introducing confusion by allowing many tarballs with different
 contents written from the same commits with such tweaks all
labelled with the same pax header?

See my later message. I think it depends on how the embedded id
is used. Is it to say "this represents the tree of this git
commit"? Or is it to help people who later have a tarball and
have no clue which commit it might have come from?

People, who have no clue which part of the subtree was extract and
 what leading path was added, would still have to wonder where the
 tree came from even with the embedded id. Without your patch, if
the tarball has an embedded id, wouldn't they at least be able to
assume it is the whole thing of that commit? If you label a
randomly mutated tree with the same label, you cannot tell the
genuine one from manipulated ones.

Not that I have strong opinions on this, either, but that is what I
meant by "_introducing_" confusion.

When we started to write the ID into generated archives, there was
only git-tar-tree and no<rev>:<path>  syntax.  It would write the ID
 only if it was given a commit and not if it got a tree or if the
user started it from a subdirectory.  The result was that only the
full tree of a commit was branded with the commit ID.

Now we have git archive, a more flexible command line syntax all
around, path limiting as well as attributes that can affect the
contents of the files in the archive.  Back then the commmit ID was
sufficient as a concise and canonical label of the archive contents,
 but now things are a bit more complicated.

Which use cases are we aiming for?  Do we want to include all of the
command line arguments (with revs resolved to SHA1-IDs)?  Only those
that modify archive contents?  And any applied attributes?  Or do we
want to get stricter and only write the commit ID if a full unchanged
tree of a commit is being archived?

In regards to the use cases you enumerated, I think logging the command
line syntax along with the appropriate ref context (HEAD value, etc)
would document exactly what's in the archive.

In regards to use cases in general, my impression is that git-archive is 
for producing archives useful for deployment.  The target deployed 
structure may vary so expecting the source git repo to reflect this is 
unfeasable.  It seems like utilizing the local tar installation would 
effect the necessary transformations. I'm not sure what the source and 
target tar version disparity problems might me.

A practical problem with the pax header is that its only useful if you
still have the archive.  Archives usually get deleted after being
extracted.  Therefore, an option to also generate (and add to the 
archive) an automatic "VERSION.TXT" file of some sort which specifies 
the context of the archive would be much more useful.  It would need its 
own --prefix option because oftentimes it would be dynamically generated 
based on the git-archive request.

Another use case is that it seems like there should also be the option 
to only tar the objects changed between a specified range of commits. 
However, I'm not sure if tar can handle deletions (moves, deletions, 
renames) upon extraction in this context.

I can see that my use cases are something that I can script myself, but 
to do so it seems like I would be better off using a non-bare repo 
checkout as an intermediary.  If that is what I am expected to do then I 
am not sure what the usefulness of git-archive is intended to be.  Maybe 
I don't understand what others use it for.

v/r,
neal

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html