Here is my GSoC 2021 Proposal draft v2. And website version is there : https://docs.google.com/document/d/119k-Xa4CKOt5rC1gg1cqPr6H3MvdgTUizndJGAo1Erk/edit Welcome any Comments and Correct :) -------8<--------- Use ref-filter formats in git cat-file About Me Name ZheNing Hu Major Computer Science And Technology Mobile no. +86 15058356458 Email adlternative@xxxxxxxxx IRC adlternative (on #git-devel/#git@freenode) Github https://github.com/adlternative/ Blogs https://adlternative.github.io/ Time Zone CST (UTC +08:00) Education & Background I am currently a 2nd Year Student majoring in computer science and technology in Xi'an University of Posts & Telecommunications (China). In my freshman year,I joined the XiYou Linux Group of the university and learned how to use Git to submit my own code to GitHub. I have learned C, C++, Python and shell in two years, I know how to use gdb debugging, and I am familiar with relevant knowledge of Linux System Programming and Linux Network Programming. I started learning Git source code and made contributions to Git from December of 2020. Me & Git Around last November, I found a couple of projects on GitHub[1] teaching me how to write a simple Git, the mechanics of Git are very interesting: 1. There are four types of objects in Git: blob, tree, commit, tag. 2. The (loose)objects(with SHA-1 hash algorithm) are stored in ".git/objects/sha1[0-1]/sha1[2-39]" with the sha1 value of the object data as the storage address. 3. All branches are just references to commits. ... Then I read 《Pro Git》 and Jiang Xin's 《Git Authoritative Guide》, learned the use of most Git subcommands. Later, I started learning some of the Git source code, I found Git has at least 200,000 lines of C code and 200,000 lines of shell script code, which leaves me a little confused about where to start. But then, after I submitted my first patch, a lot of people in the Git community came over and gave me very enthusiastic guidance, which gave me the courage to learn the Git source code, and then I started making my own contributions, You can find them here:[2][3] And These patches are in Git master branch: [master] difftool.c: learn a new way start at specified file https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@xxxxxxxxx/ ls-files.c: add --deduplicate option https://lore.kernel.org/git/384f77a4c188456854bd86335e9bdc8018097a5f.1611485667.git.gitgitgadget@xxxxxxxxx/ ls_files.c: consolidate two for loops into one https://lore.kernel.org/git/f9d5e44d2c08b9e3d05a73b0a6e520ef7bb889c9.1611485667.git.gitgitgadget@xxxxxxxxx/ ls_files.c: bugfix for --deleted and --modified https://lore.kernel.org/git/8b02367a359e62d7721b9078ac8393a467d83724.1611485667.git.gitgitgadget@xxxxxxxxx/ builtin/*: update usage format https://lore.kernel.org/git/d3eb6dcff1468645560c16e1d8753002cbd7f143.1609944243.git.gitgitgadget@xxxxxxxxx/ format-patch: allow a non-integral version numbers https://lore.kernel.org/git/pull.885.v10.git.1616497946427.gitgitgadget@xxxxxxxxx/ [GSOC] commit: add --trailer option https://lore.kernel.org/git/pull.901.v14.git.1616507757999.gitgitgadget@xxxxxxxxx/ And These patches are working: [wip] gitk: add right-click context menu for tags https://lore.kernel.org/git/pull.866.v5.git.1614227923637.gitgitgadget@xxxxxxxxx/ [GSOC] trailer: add new .cmd config option https://lore.kernel.org/git/3dc8983a47020fb417bb8c6c3d835e609b13c155.1617975462.git.gitgitgadget@xxxxxxxxx/ [GSOC] docs: correct descript of trailer.<token>.command https://lore.kernel.org/git/505903811df83cf26f4dd70c5b811dde169896a2.1617975462.git.gitgitgadget@xxxxxxxxx/ [GSOC] ref-filter: get rid of show_ref_array_item https://lore.kernel.org/git/pull.927.v2.git.1617809209164.gitgitgadget@xxxxxxxxx/ Proposed Project Current situation Git used to have an old problem of duplicated implementations of some logic. For example, Git had at least 4 different implementations to format command output for different commands. E.g. `git cat-file --batch=“%(objectname)”`, `git log --pretty=“%aN”`, `git for-each-ref --format=“%(refname)”`. Which implementations have been merged together? 2018 ~ 2019 Olga Telezhnaia: Reuse ref-filter formatting logic in `git cat-file` Olga Integrate some `git cat-file` logic into the `ref-filter`, now almost all format atoms in the `git cat-file` are available in the `git for-each-ref`, e.g. `git for-each-ref --format=“%(objectsize:disk) %(deltabase) ”`. 2020 ~ 2021 Hariom Verma: Unify ref-filter formats with other --pretty formats Hariom migrated some of the '--pretty' logic to the 'ref-filter', e.g. `git for-each-ref --format="%(trailers:key=Signed-off-by)"` or ` git for-each-ref --format="%(subject:sanitize)"`. What’s git cat-file? `git cat-file` is a Git subcommand used to see information about a Git object. `git cat-file --batch` can read objects from stdin and print each object information and contents to stdout. `git --batch-check` can read objects from stdin and print each object information to stdout. `--batch-all-objects` will show all objects info in the git repo with `--batch` or `--batch-check`. `--batch-check` and `--batch` both accept a custom format that can have placeholders like the following, refer to here[4]: %(objectname) The full hex representation of the object name. %(objecttype) The type of the object. %(objectsize) The size, in bytes, of the object. %(objectsize:disk) The size, in bytes, that the object takes up on disk. %(deltatbase) If the object is stored as a delta on-disk, this expands to the full hex representation of the delta base object name. Otherwise, expands to the null OID (all zeroes). %(rest) If this atom is used in the output string, input lines are split at the first whitespace boundary. All characters before that whitespace are considered to be the object name; characters after that first run of whitespace (i.e., the "rest" of the line) are output in place of the `%(rest)` atom. What’s the original design of git cat-file --batch? 1. First time use `expand_format()` in `batch_objects()` is used to parse format atoms, this will determine what data we need to capture. 2. Read the object name from standard input,and use it to get the object's oid from `get_oid_with_context()`. 3. In `batch_object_write()`, `oid_object_info_extended()` will obtain the object information which we need. 4. Second time use `expand_format()` in `batch_object_write()`, will formatting actual items, and store it in a string buffer, eventually the contents of this buffer will be printed to standard output. What are the disadvantages in git cat-file --batch? atom format-parsing stage and formatting actual items stage are not separated yet. This limits the ability of `git cat-file --batch` to support richer formats like `git for-each-ref` or `git log --pretty`. Why is Olga’s solution rejected? 1. Olga's solution is to let `git cat-file` use the `ref-filter` interface, the performance of `cat-file` appears to be degraded due "very eager to allocate lots of separate strings" in `ref-filter` and other reasons. 2. Then Olga adopted the method of optimizing `ref-filter`, but the performance of `git cat-file` is still not as good as the previous method. 3. Too long patch series, difficult to adjust and merge. 4. Is “%(rest)” worth migrating? “%(rest)” is for `git cat-file --batch` which will be read from the terminal, anything after the space on each line will continue to be printed, this option is quite unnecessary for `git for-each-ref`, which does not require standard input. My possible solution 1. Analyze how to get data which `oid_object_info_extended()` can't get directly, analyze the minimum amount of data required for each step of atom format parsing. 2. Find a uniform way to parse format, like `%an` in `log --pretty` or `%(authorname)` in `ref-filter`(might it can learn something from `git config` or can try using abstract syntax trees for format atoms parsing). 3. Apply the new interface to 'git cat-file', and then we could add richer options for `git cat-file`. 4. (Optional optimization) Change the strbuf allocate strategy of `ref-filter`: Use a single strbuf for all refs output. Improving its performance, reducing the overhead of allocating large numbers of small strbuf. 5. (Optional optimization) In addition, if we migrate `cat-file` to `ref-filter` only with improved performance of `ref-filter`, we need to isolate some atoms that are not applicable to `cat-files`. For example, `refname` is not useful for `git cat-file`, we can exit the program by using `die()` or just print error messages. Are you applying for other Projects? No, Git is the only one. Blogging about Git In fact, while I am studying Git source code, I often write some blogs[5] to record my learning content, this helps me to recall some content after forgetting it. Most of the blogs were written in Chinese previously, but during the GSoC, I promise all my blogs will be written in English. Time Line May 18 ~ June 18 1. Learn the details of atom format parsing in `ref-filter.c` and `pretty.c`. Think about how to combine two different atom formats in a parsing way. (For example, how can we use abstract syntax trees to organize different atoms) 2. Analyze and optimize `ref-filter` performance. 3. Discuss with mentors about a reasonable solution about uniform formatting parsing, and then start coding it. June 18 ~ July 18 1. Continue to integrate the atom format parsing and apply it to `pretty.c` and `ref-filter.c`. 2. Make sure that the performance of `git for-each-ref` and `git log --pretty` are better than the previous methods under the new format parsing interface. July 18 ~ August 17 1. Let `git cat-file` use the new and better formatting parsing interface. 2. Support more options for `cat-file --batch` and ensure isolation from those different types of atoms. Availability I have plenty of time before and after my final exam, I have enough energy to complete daily tasks. I'm staying active on the Git mailing list, you can find me at any time as long as I am not sleeping. :) Post GSoC I love open source philosophy, willing to spread the spirit of openness, freedom and willing to research technology with like-minded people. In my previous contact with the Git community in the past few months, many people in the Git community gave me great encouragement. I hope I can keep my passion for Git alive, contribute my own code, and pass this cool thing on. I am willing to contribute code to the Git community for a long time after the end of GSoC. I hope the Git community can give me a chance to participate in GSoC. I sincerely thank GSoC and the Git community! ________________ [1] https://github.com/danistefanovic/build-your-own-x#build-your-own-git [2] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Aadlternative+ [3] https://git.kernel.org/pub/scm/git/git.git/log/?qt=grep&q=ZheNing+Hu [4] https://github.com/gitgitgadget/git/blob/89b43f80a514aee58b662ad606e6352e03eaeee4/Documentation/git-cat-file.txt#L189 [5] https://adlternative.github.io/tags/git/