On Sun, Jun 03, 2018 at 09:48:16PM +0200, Ævar Arnfjörð Bjarmason wrote: > Sure, but what I'm pointing out is a) you can't focus on git as the > technology because it tells you nothing about what's being done with it > (e.g. the log file case I mentioned b) nobody who came up with the GDPR > was concerned with some free software projects or the SCM used by > companies, so this is very unlikely to be enforced. As I already said, the GDPR refers to the state of the art in technology, without defining it. The GDPR provides a generic framework. It covers everyone. From a single person running a small blog to a S&P500 enterprise. It also covers non-profits and state authorities. Everyone is covered. Including SCM used. The GDPR will be enforced against SCMs. The question is just who will be the first to be affected. I suspect it will be a mega-corporation who fired one of their developers who wants to fight back and exercise his right to be forgotten against the company's public git repos. > So nobody can be GDPR compliant in the face of archive.org and the like? The GDPR has special exceptions for archives and the like. > It does if you've got the ref. Maybe I just don't get your proposal, > quote: > > Do not hash anything directly to obtain the commit ID. Instead, hash a > list of hashes of [$random_number, $information] pairs. $information > could be an author id, a commit date, a comment, or anything else. Then > store the commit id, the list of hashes, and the list of pairs to form > the commit. > > You're just proposing (if I've read this correctly) that the commit > object should have some list of headers pointing to other SHA1s, and > that fsck and the like be OK with these going away. Right? Certainly not SHA1. SHA1 is completely broken. I know Linus has a bit of a different opinion. But there's really no defense for SHA1. It's an utterly broken algorithm and should not be used at all anymore. > How is this intrinsically different from referring to something in the > ref namespace that may be deleted in the future? I guess I am partly repeating myself, but: 1. Having fsck be OK with erasure is not enough. It tells you nothing about anonymization. If the hash is the same in 5000 instances that's pseudonymization, not anonymization. You need to ensure a different hash in each instance, and you need to ensure there's no easy way to reconstruct the data from its hash. Hence $random_number (or let's call it $huge_random_number, it should have x bits if the hash has x bits). If you have the SHA1 64ca93f83bb29b51d8cbd6f3e6a8daff2e08d3ec it's too easy to figure out the plaintext (it's "Peter" BTW). 2. If you use a random UUID you cannot reconstruct the data from its hash, but you have the same issue about UUID reuse. Plus, you lose the ability to verify the author's name as part of the commit. > Okey, so you're not reading the GDPR in some literal sense, but you're > coming to a conclusion that's supported by ... what? To echo Theodore > Y. Ts'o E-Mail have you consulted with someone who's an actual lawyer on > this subject? I'm replying in private conversation about this one. It's not relevant for this discussion. Best wishes Peter -- Peter Backes, rtc@xxxxxxxxxxxxxxxxxxx