Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/29/2019 1:24 AM, Jeff King wrote:
> On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote:
> 
>>> So here are a few patches to reduce the CPU and memory usage. They could
>>> be squashed in at the appropriate spots, or perhaps taken as inspiration
>>> if there are better solutions (especially for the first one).
>>
>> I tested these patches with the Linux kernel repo and reported my results
>> on each patch. However, I wanted to also test on a larger internal repo
>> (the AzureDevOps repo), which has ~500 commits with more than 512 changes,
>> and generally has larger diffs than the Linux kernel repo.
>>
>> | Version  | Time   | Memory |
>> |----------|--------|--------|
>> | Garima   | 16m36s | 27.0gb |
>> | Peff 1   | 6m32s  | 28.0gb |
>> | Peff 2   | 6m48s  |  5.6gb |
>> | Peff 3   | 6m14s  |  4.5gb |
>> | Shortcut | 3m47s  |  4.5gb |
>>
>> For reference, I found the time and memory information using
>> "/usr/bin/time --verbose" in a bash script.
> 
> Thanks for giving it more exercise. My heap profiling was done with
> massif, which measures the heap directly. Measuring RSS would cover
> that, but will also include the mmap'd packfiles. That's probably why
> your linux.git numbers were slightly higher than mine.

That's interesting. I initially avoided massif because it is so much
slower than /usr/bin/time. However, the inflated numbers could be
explained by that. Also, the distinction between mem_heap and
mem_heap_extra may have interesting implications. Looking online, it
seems that large mem_heap_extra implies the heap is fragmented from
many small allocations.

Here are my findings on the Linux repo:

| Version  | mem_heap | mem_heap_extra |
|----------|----------|----------------|
| Peff 1   |  6,500mb |          913mb |
| Peff 2   |  3,100mb |          286mb |
| Peff 3   |    781mb |          235mb |

These numbers more closely match your numbers (in sum of the two
columns).

> (massif is a really great tool if you haven't used it, as it also shows
> which allocations were using the memory. But it's part of valgrind, so
> it definitely doesn't run on native Windows. It might work under WSL,
> though. I'm sure there are also other heap profilers on Windows).

I am using my Linux machine for my tests. Garima is using her Windows
machine.

>> By "Shortcut" in the table above, I mean the following patch on top of
>> Garima's and Peff's changes. It inserts a max_changes option into struct
>> diff_options to halt the diff early. This seemed like an easier change
>> than creating a new tree diff algorithm wholesale.
> 
> Yeah, I'm not opposed to a diff feature like this.
> 
> But be careful, because...
> 
>> diff --git a/diff.h b/diff.h
>> index 6febe7e365..9443dc1b00 100644
>> --- a/diff.h
>> +++ b/diff.h
>> @@ -285,6 +285,11 @@ struct diff_options {
>>  	/* Number of hexdigits to abbreviate raw format output to. */
>>  	int abbrev;
>>  
>> +	/* If non-zero, then stop computing after this many changes. */
>> +	int max_changes;
>> +	/* For internal use only. */
>> +	int num_changes;
> 
> This is holding internal state in diff_options, but the same
> diff_options is often used for multiple diffs (e.g., "git log --raw"
> would use the same rev_info.diffopt over and over again).
> 
> So it would need to be cleared between diffs. There's a similar feature
> in the "has_changes" flag, though it looks like it is cleared manually
> by callers. Yuck.

You're right about this. What if we initialize it in diff_tree_paths()
before it calls the recursive ll_difF_tree_paths()?

> This isn't a problem for commit-graph right now, but:
> 
>   - it actually could be using a single diff_options, which would be
>     slightly simpler (it doesn't seem to save much CPU, though, because
>     the initialization is relatively cheap)
> 
>   - it's a bit of a subtle bug to leave hanging around for the next
>     person who tries to use the feature
> 
> I actually wonder if this could be rolled into the has_changes and
> diff_can_quit_early() feature. This really just a generalization of that
> feature (which is like setting max_changes to "1").

I thought about this at first, but it only takes a struct diff_options
right now. It does have an internally-mutated member (flags.has_changes)
but it also seems a bit wrong to add a uint32_t of the count in this.
Changing the prototype could be messy, too.

There are also multiple callers, and limiting everything to tree-diff.c
limits the impact.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux