Re: GSoC Git Proposal Draft - ZheNing Hu

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, Christian,

Christian Couder <christian.couder@xxxxxxxxx> 于2021年4月2日周五 下午10:57写道:
>
> Hi,
>
> On Fri, Apr 2, 2021 at 11:03 AM ZheNing Hu <adlternative@xxxxxxxxx> wrote:
> >
> > Hello, Git,
> > I'm ZheNing Hu,
> > Here is my GSoC 2021 Proposal draft.
> > And website version is there :
> > https://docs.google.com/document/d/119k-Xa4CKOt5rC1gg1cqPr6H3MvdgTUizndJGAo1Erk/edit
> >
> > Welcome any Comments and Correct :)
>
> Thanks!
>
> > ----8<----
> > ## Use ref-filter formats in git cat-file
> >
> > ### About Me
> > | Name | ZheNing Hu |
> > | ---------- | ------------------------------------------ |
> > | Major | Computer Science And Technology |
> > | Mobile no. | +86 15058356458 |
> > | Email | adlternative@xxxxxxxxx |
> > | IRC | adlternative (on #git-devel/#git@freenode) |
> > | Github | https://github.com/adlternative/ |
> > | Blogs | https://adlternative.github.io/ |
> > | Time Zone | CST (UTC +08:00) |
> >
> > ### Education & Background
> > * I am currently a 2nd Year Student majoring in computer science and
> > technology in Xi'an University of Posts & Telecommunications (China).
> > * In my freshman year, I joined the XiYou Linux Group of the
> > university and learned how to use Git to submit my own code to GitHub.
> > I have learned C, C++, Python and shell in two years, I know how to
> > use gdb debugging, and I am familiar with relevant knowledge of Linux
> > System Programming and Linux Network Programming.
> > * I started learning Git source code and made contributions to Git
> > from December of 2020.
> >
> > ### Me & Git
> > Around last November, I found a couple of projects
> > [build-your-own-git](https://github.com/danistefanovic/build-your-own-x#build-your-own-git)
> > on GitHub teaching me how to write a simple git, the mechanics of Git
> > are very interesting:
> >
> > 1. There are four types of objects in Git: BLOB, TREE, COMMIT, TAG
> > 2. The (loose)objects are stored in `.git/object/sha1[0-1]/sha1[2-39]`
> > with the sha1 value of the data as the storage address.
> > 3. All branches are just references to commits.
> >
> > Then I read`《Pro Git》`and Jiang Xin's `《Git Authoritative Guide》`,
> > learned the use of most Git subcommands.
> >
> > Later, I started learning some of the Git source code, I found Git has
> > at least 200,000 lines of C code and 200,000 lines of shell script
> > code, which leaves me a little confused about where to start.
> >
> > But then, after I submitted my first patch, a lot of people in the Git
> > community came over and gave me very enthusiastic guidance, which gave
> > me the courage to learn the Git source code, and then I started making
> > my own contributions, You can find them here:
> > [gitgitgadget](https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Aadlternative+)
> > or
> > [git.kernel.org](https://git.kernel.org/pub/scm/git/git.git/log/?qt=grep&q=ZheNing+Hu)
> >
> >
> > These patches have been merged into the "master" branch:
> >
> > #### [master]
> > * difftool.c: learn a new way start at specified file [(mail
> > list)](https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@xxxxxxxxx/)
> > * ls-files.c: add --deduplicate option
> > [(mail list)](https://lore.kernel.org/git/384f77a4c188456854bd86335e9bdc8018097a5f.1611485667.git.gitgitgadget@xxxxxxxxx/)
> > * ls_files.c: consolidate two for loops into one
> > [(mail list)](https://lore.kernel.org/git/f9d5e44d2c08b9e3d05a73b0a6e520ef7bb889c9.1611485667.git.gitgitgadget@xxxxxxxxx/)
> > * ls_files.c: bugfix for --deleted and --modified
> > [(mail list)](https://lore.kernel.org/git/8b02367a359e62d7721b9078ac8393a467d83724.1611485667.git.gitgitgadget@xxxxxxxxx/)
> > * builtin/*: update usage format
> > [(mail list)](https://lore.kernel.org/git/d3eb6dcff1468645560c16e1d8753002cbd7f143.1609944243.git.gitgitgadget@xxxxxxxxx/)
> >
> > And These patches are in the queue:
> >
> > #### [next]
> >
> > * format-patch: allow a non-integral version numbers
> > [(mail list)](https://lore.kernel.org/git/pull.885.v10.git.1616497946427.gitgitgadget@xxxxxxxxx/)
> > * [GSOC] commit: add --trailer option
> > [(mail list)](https://lore.kernel.org/git/pull.901.v14.git.1616507757999.gitgitgadget@xxxxxxxxx/)
> >
> > #### [WIP]
> >
> > * gitk: add right-click context menu for tags
> > [(mail list)](https://lore.kernel.org/git/pull.866.v5.git.1614227923637.gitgitgadget@xxxxxxxxx/)
> > * [GSOC] trailer: pass arg as positional parameter
> > [(mail list)](https://lore.kernel.org/git/5894d8c4b36466326b0427bfda0d6981e52a0907.1617185147.git.gitgitgadget@xxxxxxxxx/)
>
> Great!
>
> > ### Proposed Project
> >
> > * Git used to have an old problem of duplicated implementations of
> > some logic. For example, Git had at least 4 different implementations
> > to format command output for different commands.
>
> What's the current status? Which implementations have been merged
> together since that time?
>

Under the current situation,
there are `git cat-file` using `expand_format()` for format parsing,
and `git for-each-ref` using `format_ref_array_item()` for format parsing,
and `git log --pretty` using `format_commit_one()` item for format parsing,
maybe have more?

In my general understanding now, these `cat-file` atoms, `ref-filter` have
related implementations.

%(objectsize) %(objecttype) %(objectname) %(deltabase) %(objectsize:disk)

`cat-file --batch` have a implicit %(contents) ,it already implement
in `ref-filter`.

now all them can used by `git for-each-ref`.

At the same time,
Some of the feature in 'pretty.c' can also be found in 'ref-filter.c'.

`--pretty=%s`   to %(subject)
`--pretty=%f `   to %(subject:sanitized)
`--pretty=%aN` to %(authorname)
`--pretty=%b`   to %(body)
...

On the other hand, after Olga's solution was rejected, `git cat-file`
did not directly
use the logic in 'ref-filter'. So now we can see two similar 'struct
expand_data' in
'cat-file.c' and 'ref-filter.c'. But Olga still made many useful
changes in ref-filter:
such as `grab_common_values()` support for a variety of different atoms.

> > * `git cat-file` is a git subcommand used to see information about a git object.
> >
> > * `git cat-file --batch` can print object information and contents on
> > stdin.
>
> It reads from stdin and prints on stdout.
>
> > The only difference between `--batch-check` and `--batch` is
> > that `--batch-check` does not print the contents of the object.
> > * `--batch-all-objects` will show all objects with `--batch` or `--batch-check`.
> > * `--batch-check` and `--batch` both accept formatted strings:
>
> It might be better to say that they accept a custom format that can
> have placeholders like the following:
>
> > * `%(objectname)`: 40-bit SHA1 string of Git object
>
> Git is being worked on to be able to use SHA-256 as well as SHA1.
>

Yes, one of my classmate was worried about the security of Git using
SHA1 and I told him Git is already making changes.

> > * `%(objecttype)`: Object Type blob,tree,commit,tag
> > * `%(objectsize)`: Size of the object's content
> > * `%(objectsize:disk)`: The size of the object itself on disk
> > * `%(delatbase)`: If the object is stored incrementally in Git,
>
> s/delatbase/deltabase/
>
> > Returns the SHA1 string for its delabase
>
> s/delabase/deltabase/
>

Thanks for above correcting.

> Also see above about SHA1 and SHA256.
>
> > * `%(rest)`: Anything before the space and TAB in the input
> > line is treated as an object, and anything after
> > that will be printed as usual
>
> In general it's ok to copy some parts of the doc if they are important
> for your proposal as long as you say that it comes from the doc. It's
> also ok with rephrasing parts of it, to adapt them or make sure you
> understand them though.
>

Maybe I can use the instructions in the documentation will be better.

> > * In the original design, the first time use `expand_format()` in
> > `batch_objects()` is to parsing formatted messages, the second time
>
> s/parsing/to parse/
>
> I am not sure what you call "formatted messages".
>

I'm not good at expression, As you say, it's a custom format that can have
placeholders,'%(atom)'.

> > use `expand_format()` in `batch_object_write()` is to format the
> > object information and store it in a string buffer, eventually the
> > contents of this buffer will be printed to standard output.
> >
> >
> > * [Olga](olyatelezhnaya@xxxxxxxxx) have been involved in integrating
> > `ref-filter` logic into `cat-file`
> > [(link)](https://github.com/git/git/pull/568), the problem with her
> > patches at that time:
> > 1. Too long patch series, difficult to adjust and merge.
> > 2. I don't think it's a good idea for her to use `struct
> > ref_array_item` instead of `struct expand_data` for `cat-file` to fit
> > `ref-filter` logic, because `struct ref_array_item` and `struct
> > expand_data` are not very related.
> > [(link)](https://github.com/git/git/pull/568/commits/e0aafaa76476ba5528f84b794043531ebd4633c7#diff-d03110606a7ed8cb9832bbcc572f1093435cc6115c4e58d7a7750af3c33319a7R238)
>
> Olga also sent patch series to the mailing list. Could you find them
> and tell what happened to them?
>

Peff tested the performance of Olga's `cat-file`, the performance of
`cat-file` appears to
be degraded by using the logic of ref-filter due "very eager to
allocate lots of separate strings".
[(link)](https://lore.kernel.org/git/20190228214112.GK12723@xxxxxxxxxxxxxxxxxxxxx/)

Olga add %(rest) to `for-each-ref`,
Peff say he is not sure that for-each-ref should be supporting %(rest).
[(link)](https://lore.kernel.org/git/20190228211122.GD12723@xxxxxxxxxxxxxxxxxxxxx/)

%(rest) seem not useful for `for-each-ref`,
Peff think we should add some option to ref-filter to enable/disable
placeholder like
"%(rest)" In some places where it is not needed at all.
[(link)](https://lore.kernel.org/git/20190228210753.GC12723@xxxxxxxxxxxxxxxxxxxxx/)

Olga make `struct expand_data` global,and put it in ref_filter.h
Peff say `struct expand_data` may need a more desciriptive name in
global namespace.
[(link)](https://lore.kernel.org/git/20190228213015.GI12723@xxxxxxxxxxxxxxxxxxxxx/)

Olga make `mark_query` and
Peff think `splict_on_whitespace` or `mark_query` can be deleted in
`struct expand_data` immediatly.
[(link)](https://lore.kernel.org/git/20190228212540.GF12723@xxxxxxxxxxxxxxxxxxxxx/)

> Also Hariom Verma worked on a related project recently. Could you talk
> a bit about it?
>

Hariom's work is re-use `ref-filter` logic in pretty.c|h,
I admin I might have neglected to look at his patches, but it seems
that he once also proposed
a "pretty-lib.c|h" for use ref-filter features.
[(link)](https://public-inbox.org/git/a83270485be2bebb1ce77be55ff73d136b735922.1592218662.git.gitgitgadget@xxxxxxxxx/)
I may need more time to check what's going on here.

> > * Because part of the feature of `git for-each-ref` is very similar to
> > that of `git cat-file`, I think `git cat-file` can learn some feasible
> > solutions from it.
> >
> > #### My possible solutions:
> >
> > 1. Same [solution](https://github.com/git/git/pull/568/commits/cc40c464e813fc7a6bd93a01661646114d694d76)
> > as Olga, add member `struct ref_format format` in `struct
> > batch_options`.
> > 2. Use the function
> > [`verify_ref_format()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L904)
> > to replace the first `expand_format()` for parsing format strings.
> > 3. Write a function like
> > [`format_ref_array_item()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L2392),
> > get information about objects, and use `get_object()` to grub the
> > information which we prefer (or just use `grab_common_value()`).
> > 4. The migration of `%(rest)` may require learning the handling of
> > `%(if)` ,`%(else)`.
>
> I will look at this later.
>
> > ### Are you applying for other Projects?
> >
> > No, Git is the only one.
> >
> > ### Blogging about Git
> >
> > In fact, while I am studying Git source code, I often write some
> > [blogs](https://adlternative.github.io/tags/git/) to record my
> > learning content, this helps me to recall some content after
> > forgetting it. Most of the blogs were written in Chinese previously,
> > but during the GSoC, I promise all my blogs will be written in
> > English.
> >
> > ### TimeLine
> > * May 18 ~ June 8
> > * Look for a scheme to make `git cat-file` and `ref-filter` more
> > compatible, and start the integration attempt.
> > * *Stretch Goal*: move `%(objectsize)`,`%(objecttype)`,`%(objectname)` .
> >
> > * June 8 ~ July 8
> > * Move the body of the `git cat-file` attempt to the `ref-filter`
> > logic, complete the basic function realization.
> > * *Stretch Goal*: move `%(deltabase)`,`%(objectsize:disk)`,`%(rest)` .
> >
> > * July 8 ~ August 17
> > * Analyze the performance of ref-filter and try to reduce the
> > performance cost of a lot of string matching. I thought if I had some
> > spare time, I could work on some other interesting patches.
> > * *Stretch Goal*: Optimize ref-filter performance.
>
> I will also look at the timeline later.
>
> > ### Availability
> > My exam is expected to end in June, but the time I don't have classes
> > before the final exam, as well as the summer vacation after that, is
> > basically my self-learning time. Although I am studying many other
> > courses, I have enough time and energy to complete daily tasks. I'm
> > staying active on the Git mailing list, you can find me at any time as
> > long as I am not sleeping. :)
> >
> >
> > ### Post GSoC
> > * I love open source philosophy, willing to spread the spirit of
> > openness, freedom and willing to research technology with like-minded
> > people.
> > * In my previous contact with the Git community in the past few
> > months, many people in the Git community gave me great encouragement.
> > I hope I can keep my passion for Git alive, contribute my own code,
> > and pass this cool thing on.
> > * I am willing to contribute code to the Git community for a long time
> > after the end of GSoC.
> > * I hope the Git community can give me a chance to participate in
> > GSoC. I sincerely thank GSoC and the Git community!
>
> Thanks!

Thanks :)




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux