Hello: Git patch-id is often used as convenient way to represent patches based on their content. It accomplishes this by attempting to normalize a patch before producing a hash of the result -- most notably, by trimming a lot of whitespace. Unfortunately, this does not work well with syntactically-significant whitespace languages, notably Python and Make. For example, the following two patches produce identical patch-id's, but one of them is actually malicious. Benign: diff --git a/file1.py b/file1.py index e574c49..6aa1937 100644 --- a/file1.py +++ b/file1.py @@ -1,3 +1,13 @@ #!/usr/bin/python +def is_logged_in(cookie): + if cookie: + print('User is logged in') + return True + + return False + +if is_logged_in(True): + print('You are logged in') + print('Hello!') Malicious ("return True" is unindented, which results in is_logged_in() always returning "True"): diff --git a/file1.py b/file1.py index e574c49..6aa1937 100644 --- a/file1.py +++ b/file1.py @@ -1,3 +1,13 @@ #!/usr/bin/python +def is_logged_in(cookie): + if cookie: + print('User is logged in') + return True + + return False + +if is_logged_in(True): + print('You are logged in') + print('Hello!') This mostly becomes a problem if we try to build any kind of patch indexing/retrieval systems that rely on patch-id to identify patches. While this is not a high-impact problem by any means, it's not a theoretical concern: git-format-patch includes functionality to provide patch dependencies via prerequisite-patch-id trailers [1]. An automated system attempting to auto-fetch dependencies can potentially retrieve and apply the malicious version of the patch. I'm not sure what the solution here is, since changing git-patch-id to not discard whitespace is obviously going to defeat its entire purpose of "not ever changing". I mostly wanted to share my findings in case someone has thoughts on how to best approach this. -K [1] https://git-scm.com/docs/git-format-patch#_base_tree_information