Re: [GSoC] Git Blog 4

Christian Couder <christian.couder@xxxxxxxxx> · Mon, 14 Jun 2021 10:02:19 +0200

On Sun, Jun 13, 2021 at 4:17 PM ZheNing Hu <adlternative@xxxxxxxxx> wrote:

> In addition, some scripts like `printf "%b" "a\0b\0c" >blob1` will
> be truncated at the first NUL on a 32-bit machine, but it performs
> well on 64-bit machines, and NUL is normally stored in the file.
> This made me think that Git's file decompression had an error on
> a 32-bit machine before I used Ubuntu32's docker container to
> clone the git repository and In-depth analysis of bugs... In the end,
> I used `printf "a\0b\0c"` to make 32-bit machines not truncated
> in NUL. Is there a better way to write binary data onto a file than
> `printf` and `echo`?

You might want to take a look at t/t4058-diff-duplicates.sh which has
the following:

# make_tree_entry <mode> <mode> <sha1>
#
# We have to rely on perl here because not all printfs understand
# hex escapes (only octal), and xxd is not portable.
make_tree_entry () {
       printf '%s %s\0' "$1" "$2" &&
       perl -e 'print chr(hex($_)) for ($ARGV[0] =~ /../g)' "$3"
}

> Since I am a newbie to docker, I would like to know if there is any
> way to run the Git's Github CI program remotely or locally?

There are scripts in the ci/ directory, but yeah it could help if
there was a README there.

> In the second half of this week, I tried to make `cat-file` reuse the
> logic of `ref-filter`. I have to say that this is a very difficult process.
> "rebase -i" again and again to repair the content of previous commits.
> squeeze commits, split commits, modify commit messages... Finally, I
> submitted the patches to the Git mailing list in
> [[PATCH 0/8] [GSOC][RFC] cat-file: reuse `ref-filter`
> logic](https://lore.kernel.org/git/pull.980.git.1623496458.gitgitgadget@xxxxxxxxx/).
> Now `cat-file` has learned most of the atoms in `ref-filter`. I am very
> happy to be able to make git support richer functions through my own code.
>
> Regrettably, `git cat-file --batch --batch-all-objects` seems to take up
> a huge amount of memory on a large repo such as git.git, and it will
> be killed by Linux's oom.

In the cover letter of your patch series you say:

"There is still an unresolved issue: performance overhead is very large, so
that when we use:

git cat-file --batch --batch-all-objects >/dev/null

on git.git, it may fail."

Is this the same issue? Is it only a memory issue, or is your patch
series also making things slower?

> This is mainly because we will make a large
> number of copies of the object's raw data. The original `git cat-file`
> uses `read_object_file()` or `stream_blob()` to output the object's
> raw data, but in `ref-filter`, we have to use `v->s` to copy the object's
> data, it is difficult to eliminate `v->s` and print the output directly to the
> final output buffer. Because we may have atoms like `%(if)`, `%(else)`
> that need to use buffers on the stack to build the final output string
> layer by layer,

What does layer by layer mean here?

> or the `cmp_ref_sorting()` needs to use `v->s` to
> compare two refs. In short, it is very difficult for `ref-filter` to reduce
> copy overhead. I even thought about using the string pool API
> `memintern()` to replace `xmemdupz()`, but it seems that the effect
> is not obvious. A large number of objects' data will still reside in memory,
> so this may not be a good method.

Would it be possible to keep the data for a limited number of objects,
then print everything related to these objects, free their data and
start again with another limited number of objects?

> Anyway, stay confident. I can solve these difficult problems with
> the help of mentors and reviewers. `:)`

Sure :-)