On 10/13/2017 9:15 AM, Derrick Stolee wrote:
On 10/13/2017 8:44 AM, Jeff King wrote:
On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:
On 13.10.2017 15:04, Junio C Hamano wrote:
Christian Couder <christian.couder@xxxxxxxxx> writes:
Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?
Aren't you suggesting to optimize for a wrong case?
Anything that is possible with a software should be considered as a
possible
usecase. It's in fact a DoS attack. Imagine there's a server that
using git
to process something, and now there's a way to knock down this
server. It's
also bad from a promotional stand point.
But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).
Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.
As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).
I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).
-Peff
I'm particularly interested in why `git rev-list HEAD -- [path]` gets
slower in this case, because I wrote the history algorithm used by
VSTS. In our algorithm, we only walk the list of objects from commit
to the tree containing the path item. For example, in the path
d0/d0/d0, we would only walk:
commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0
entry]
From this, we can determine the TREESAME relationship by parsing four
objects without parsing all contents below d0/d0/d0.
The reason we have exponential behavior in `git rev-list` is because
we are calling diff_tree_oid() in tree-diff.c recursively without
short-circuiting on equal OIDs.
I will prepare a patch that adds this OID-equal short-circuit to avoid
this exponential behavior. I'll model my patch against a similar patch
in master:
commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees:
avoid duplicate ODB lookups during checkout
It will also significantly speed up rev-list calls for short paths in
deep repositories. It will not be very measurable in the git or Linux
repos because their shallow folder structure.
Thanks,
-Stolee
Since I don't understand enough about the consumers to diff_tree_oid()
(and the fact that the recursive behavior may be wanted in some cases),
I think we can fix this in builtin/rev-list.c with this simple diff:
---
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const
char *prefix)
git_config(git_default_config, NULL);
init_revisions(&revs, prefix);
+
+ revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+
revs.abbrev = DEFAULT_ABBREV;
revs.commit_format = CMIT_FMT_UNSPECIFIED;
argc = setup_revisions(argc, argv, &revs, NULL);