From: "Jeff King" <peff@xxxxxxxx>: Friday, August 22, 2014 12:21 AM
On Thu, Aug 21, 2014 at 06:49:10PM -0400, Jeff King wrote:
The few things I don't anonymize are:
1. ref prefixes. We see the same distribution of refs/heads vs
refs/tags, etc.
2. refs/heads/master is left untouched, for convenience (and
because
it's not really a secret). The implementation is lazy, though,
and
would leave "refs/heads/master-supersecret", as well. I can
tighten
that if we really want to be careful.
3. gitlinks are left untouched, since sha1s cannot be reversed.
This
could leak some information (if your private repo points to a
public, I can find out you have it as submodule). I doubt it
matters, but we can also scramble the sha1s.
Here's a re-roll that addresses the latter two. I don't think any are
a
big deal, but it's much easier to say "it's handled" than try to
figure
out whether and when it's important.
This also includes the documentation update I sent earlier. The
interdiff is a bit noisy, as I also converted the anonymize_mem
function
to take void pointers (since it doesn't know or care what it's
storing,
and this makes storing unsigned chars for sha1s easier).
Just a bit of bikeshedding for future improvements..
The .gitignore is another potential user problem area that may benefit
form not being anonymised when problems strike. For example, there's a
current problem on the git-users list
https://groups.google.com/forum/#!topic/git-users/JJFIEsI5HRQ about "git
clean vs git status re .gitignore", which would then also beg questions
about retaining file extensions/suffixes (.txt, .o, .c, etc).
I've had a similar problem with an over zealous file compare routine
where the same too much vs too little was an issue.
One thought is that the user should be able to, as an option, select the
number of initial characters retained from filenames, and similarly, the
option to retain the file extension, and possibly directory names, such
that the full .gitignore still works in most cases, and the sort order
works (as far as it goes on number of characters).
All things for future improvers to consider.
Philip
-- >8 --
Subject: teach fast-export an --anonymize option
Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
"anonymized" repository could then be shared with developers
(assuming it still replicates the original problem).
This patch implements an "--anonymize" option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:
git fast-export --anonymize --all |
perl -pe 's/\d+/X/g' |
sort -u |
less
which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
"User 0", and we replace it consistently in the output).
In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:
$ time git fast-export --anonymize --all \
--tag-of-filtered-object=drop >output
real 0m2.883s
user 0m2.828s
sys 0m0.052s
$ gzip output
$ ls -lh output.gz | awk '{print $5}'
2.9M
Signed-off-by: Jeff King <peff@xxxxxxxx>
---
[...]
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html