On Wed, Sep 20, 2017 at 03:48:26PM -0700, Jonathan Nieder wrote: > > Subject: [PATCH] revision: replace "struct cmdline_pathspec" with argv_array > > > > We assemble an array of strings in a custom struct, > > NULL-terminate the result, and then pass it to > > parse_pathspec(). > > > > But then we never free the array or the individual strings > > (nor can we do the latter, as they are heap-allocated when > > they come from stdin but not when they come from the > > passed-in argv). > > To be devil's advocate for a moment: why don't we use UNLEAK on the > paths passed in from stdin? > > It's true that there can be an unbounded number of them, but they all > coexist in memory anyway. They are not generated dynamically on the > fly. Being able to free them doesn't have any obvious advantage over > using exit(). > > Except... is the idea that this allows the strings from stdin to be > freed sooner, as soon as they have been parsed into a "struct > pathspec"? Well, no...the idea is that this is a function which leaks a bunch of memory, and we shouldn't have to think hard about how often its leak can be triggered or how severe it is. We should just fix it. I agree with your analysis that we're likely only to leak one set of --stdin input per program invocation (modulo some caller re-opening stdin). And I also agree that it's a waste of memory to hold onto leaked heap that cannot be used. But mostly I am fundamentally against using UNLEAK() in a case like this, because it does not match either of the properties which justified adding UNLEAK() in the first place: 1. We are about to exit the program, so the "leak" is only caused by the memory going out of scope at that exit. By contrast, the revision machinery may be called many times in the same program. 2. The memory remains useful until around the time of program exit. This most certainly does not, as it would not even be reachable. So while it may not do _too_ much harm, aside from increasing peak heap unnecessarily, it's not a precedent I'd like to set. And certainly not when we can fix the leak and reduce the code size at the same time. > That sounds appealing. The only risk would be if "struct pathspec" > holds on to a pointer to one of these strings. > > Fortunately parse_pathspec() is careful to strdup any strings it > borrows (though it is not documented to do so). A lot of the early interfaces in Git had really confusing memory-ownership semantics. I think we've been slowly moving over the years towards interfaces that spend a bit on extra copies to give simple and consistent semantics (in fact, it was one of the reasons I added argv_array in the first place). So yes, it's good to double check that parse_pathspec() doesn't want to hold on to the pointers. But I also think that should be the default we're striving for in our APIs. > In other words, proposed changes: > > 1. Could the commit message describe what effect this would have on > maximum heap usage, if any? (In qualitative terms is fine, though > actual numbers would be even better if it's easy to get them.) > That would make it easier to justify not using UNLEAK. What wording are you looking for? It was a leak, and now it's gone. The size of the leak depends on how much you feed to --stdin. IMHO using UNLEAK is totally inappropriate for this case, and doesn't even seem like an alternative worth rejecting. > 2. Can parse_pathspec get a comment in pathspec.h saying that it > defensively copies anything it needs from args so the caller is > free to modify or free it? That way, it should be more obvious > to people in the future modifying parse_pathspec() that callers > may rely on that. (The current API comment describes argv as > "command line arguments", which I fear would send the opposite > message to implementors.) I certainly agree that the pathspec interface could use better documentation. Patches welcome? :) -Peff