Re: [PATCH v2] teach fast-export an --anonymize option

"Philip Oakley" <philipoakley@xxxxxxx> · Fri, 22 Aug 2014 19:39:59 +0100

From: "Jeff King" <peff@xxxxxxxx>: Friday, August 22, 2014 12:21 AM
On Thu, Aug 21, 2014 at 06:49:10PM -0400, Jeff King wrote:

The few things I don't anonymize are:

  1. ref prefixes. We see the same distribution of refs/heads vs
     refs/tags, etc.

  2. refs/heads/master is left untouched, for convenience (and 
because
     it's not really a secret). The implementation is lazy, though, 
and
     would leave "refs/heads/master-supersecret", as well. I can 
tighten
     that if we really want to be careful.

  3. gitlinks are left untouched, since sha1s cannot be reversed. 
This
     could leak some information (if your private repo points to a
     public, I can find out you have it as submodule). I doubt it
     matters, but we can also scramble the sha1s.

Here's a re-roll that addresses the latter two. I don't think any are 
a
big deal, but it's much easier to say "it's handled" than try to 
figure
out whether and when it's important.

This also includes the documentation update I sent earlier. The
interdiff is a bit noisy, as I also converted the anonymize_mem 
function
to take void pointers (since it doesn't know or care what it's 
storing,
and this makes storing unsigned chars for sha1s easier).

Just a bit of bikeshedding for future improvements..

The .gitignore is another potential user problem area that may benefit 
form not being anonymised when problems strike. For example, there's a 
current problem on the git-users list 
https://groups.google.com/forum/#!topic/git-users/JJFIEsI5HRQ about "git 
clean vs git status re .gitignore", which would then also beg questions 
about retaining file extensions/suffixes (.txt, .o, .c, etc).

I've had a similar problem with an over zealous file compare routine 
where the same too much vs too little was an issue.

One thought is that the user should be able to, as an option, select the 
number of initial characters retained from filenames, and similarly, the 
option to retain the file extension, and possibly directory names, such 
that the full .gitignore still works in most cases, and the sort order 
works (as far as it goes on number of characters).

All things for future improvers to consider.

Philip

-- >8 --
Subject: teach fast-export an --anonymize option

Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
"anonymized" repository could then be shared with developers
(assuming it still replicates the original problem).

This patch implements an "--anonymize" option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:

 git fast-export --anonymize --all |
 perl -pe 's/\d+/X/g' |
 sort -u |
 less

which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
"User 0", and we replace it consistently in the output).

In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:

 $ time git fast-export --anonymize --all \
        --tag-of-filtered-object=drop >output
 real    0m2.883s
 user    0m2.828s
 sys     0m0.052s

 $ gzip output
 $ ls -lh output.gz | awk '{print $5}'
 2.9M

Signed-off-by: Jeff King <peff@xxxxxxxx>
---
[...] 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html