Hello everyone, Here is the initial version of my proposal on "Reachability bitmap improvements". I would really love to have comments on it. If you find any mistakes please notify me about that. The proposal template is inspired by Hariom Verma's (This year's GSoC mentor) previous year's gsoc proposal. Thanks :) ============================================= View on Docs: https://docs.google.com/document/d/15tZ5QbMtYN41No5wdMw-2Ilp63vquB7OqAwBCLl_0n0/edit?usp=sharing ============================================== **Reachability bitmap improvements** Name : Abhradeep Chakraborty Major : Information Technology Mobile No. : [ ... ] Email : chakrabortyabhradeep79@xxxxxxxxx GitHub Profile: (Abhra303)[https://github.com/Abhra303] LinkedIn : www.linkedin.com/in/abhradeep-chakraborty-b36426201 Time Zone : Indian Standard Time ( IST ) ( UTC +05:30 ) ## About Me I am Abhradeep Chakraborty, currently pursuing B.Tech in Information Technology (now in my 3rd year) at Jalpaiguri Government Engineering College, India. In the last two years, I mainly concentrated on learning and building web development things. I mainly use Django, React for my projects. But for the past 6 months, I have been learning about devops, cloud technologies ( e.g. docker, Kubernetes, AWS etc.), computer networking and core technologies (such as git, linux, gcc etc.). I also have an interest in some research based technologies like WebRTC, Web3 etc. My ultimate objective is to be a dedicated and active contributor to all of these technologies. So, I want to start this contribution journey with `git` and therefore I want to be a part of this GSoC program. ## Me & Git I have started exploring the Git codebase since Sept 2021. At first, I mainly focused on knowing the git internals. During this time I read documentations (Under `Documentation` directory). `MyFirstContribution.txt`[1], `MyFirstObjectWalk.txt`[2] and `Hacking Git`[3] helped me to learn about the git contribution workflow, how `git log` ( in other words `object walk`) works internally. Though I had read many documentations, I was still not able to fully understand the terminologies (like `refs`, `packfiles`, `blobs`, `trees` etc.). (ProGit)[https://git-scm.com/book/en/v2] helped me tremendously here. I read the full book and it cleared almost every doubt that came into my mind. I have provided my git involvement history below. Activities are sorted by timeline from top to bottom. ============================== PRs, Commits & Discussion Participations ============================== --------------------------------- [patch] Amend error messages that violate git coding style convention Github link : https://github.com/gitgitgadget/git/pull/1062 Status : Not Merged/ PR Description : There are many error strings which do not follow coding style guides. This PR was to fix those. There is a ticket dedicated to this ( github Issue - (#635)[https://github.com/gitgitgadget/git/issues/635]). The author of this issue said it would be better to manually correct these things. Outcome : As this was my first PR, I made many mistakes and I faced difficulties. Though this was not merged, I learned tremendous new things. These violating strings were in many files ( nearly 200 files). At first, I fixed all the files at once manually. I fixed errors generated by those changes. Ultimately, I learned about overall git structures, how to fix errors etc. --------------------------------- [RFC PATCH 0/1] making set-upstream have default arguments Mailing List : https://lore.kernel.org/git/20211202144354.17416-1-chakrabortyabhradeep79@xxxxxxxxx/ Status : Not Merged/ Closed Description : `git push -u` generally ( if `push.default` not set) throws an error if no arguments are provided. This patch request was to teach `git push -u` to push and set a default upstream if no arguments are provided. There is a github (issue)[https://github.com/gitgitgadget/git/issues/1061] also. Outcome : This PR let `git push -u` to push to a default branch and set that as the upstream irrespective of `push.default` configuration. I realized it was badly affecting the reason `push.default` is built for ( See this (comment)[https://lore.kernel.org/git/20220104132839.1209-1-chakrabortyabhradeep79@xxxxxxxxx/]). So, I intentionally stopped working on it. ---------------------------------- Avoid checking for -h in commands that already use parse-options API ( Issue) Github Issue Link : https://github.com/gitgitgadget/git/issues/1125 Status : Closed ---------------------------------- [PATCH] Add usage strings ci check and amend remaining usage strings Github PR Link : https://github.com/gitgitgadget/git/pull/1147 Mailing List Link : https://lore.kernel.org/git/pull.1147.git.1645030949730.gitgitgadget@xxxxxxxxx/ Status : Merged into Master Description : According to dscho’s (comment)[https://github.com/gitgitgadget/git/issues/636#issuecomment-1018660439], There were still some usage strings that violated the style guide. So, this patch request fixed those strings and added a CI check to verify all the usage strings are passing the style check. In the first version, I built a shell script to check the validity of strings. Outcome : The first commit was merged into master. There were some contradictions about the different approaches suggested by reviewers in the discussion about the last commit i.e. whether the check should be static check or runtime check, whether I should use `parse options` method or coccinelle etc (see the (mailing list)[https://lore.kernel.org/git/nycvar.QRO.7.76.6.2202251632320.11118@xxxxxxxxxxxxxxxxx/]). Lastly, it was decided that I should try the `coccinelle` method first and check if it is better ( see this (comment)[https://lore.kernel.org/git/20220304142154.2350-1-chakrabortyabhradeep79@xxxxxxxxx/]). I will try to submit a PR regarding this before April 19. ================================== MicroProject ================================== [PATCH] Partial-Clone : add a partial-clone test case Github PR Link : https://github.com/gitgitgadget/git/pull/1175 Status : Merged into Master Github Issue : https://github.com/gitgitgadget/git/issues/827 Description : In a blobless-cloned repo, `git log –follow – <path>` (`<path>` have an exact OID rename) shouldn’t download blobs of the file from where the new file is renamed. This patch added a test case to verify it. Outcome : The test was added to the master branch. Moreover, another issue was discovered while discussing the patch request. Junio discovered that the test helper function `test_subcommand_inexact` was poorly designed. Later Derrick submitted a (PR)[https://github.com/gitgitgadget/git/pull/1185] to fix it ( now merged). ================================== Review ================================== [PATCH] test-lib-functions : fix test_subcommand_inexact Github PR Link : https://github.com/gitgitgadget/git/pull/1185 Description : Junio discovered that the `test_subcommand_inexact` function (test-helper function) is poorly designed. This PR addresses that. My comment : Though I placed it under the `Review` section; but I didn’t review the code here. Instead I gave my opinion regarding this in (this comment)[https://lore.kernel.org/git/20220324181056.53824-1-chakrabortyabhradeep79@xxxxxxxxx/]. ## Proposed Project ### Abstract While sending large repositories, git has to decide what needs to be sent, how to be sent. For large repositories, git creates pack files where it packs all the related commits, trees, blobs( in the form of deltas and bases ), tags etc. Now, finding the related objects among all the objects might be very expensive. As git only knows about the branch tips, it is very expensive to find the related objects from all the objects if we try to traverse down from the branch tips to all their respective related objects. It becomes more expensive as the number of objects (i.e. commits, trees, blobs etc.) increases. Ultimately resulting in slow fetching from the git daemon. To counter that, reachability bitmaps were introduced. It uses bitmap data structure (an array of bits) to detect if an object is reachable from a particular commit or not. The commit messages of this (patch series)[https://lore.kernel.org/git/1372116193-32762-1-git-send-email-tanoku@xxxxxxxxx/] are itself a documentation of how it is achieved. All the bitmap indexes for selected commits (in a EWAH compressed form) are stored in a `.bitmap` file which has the same name of its respective packfile. The current task is to improve the performance of this bitmap approach. There are three points that we need to work on - 1. Decompression of bitmaps is slow with EWAH. To know if an object is related to a particular commit, we have to decompress the whole thing. So, we have to consider other alternative techniques besides EWAH. 2. Loading a `.bitmap` file can be slow for large bitmaps. This is because we need to read the file sequentially in order to discover the offset of each bitmap within the file. It would be better if we can create a `table of contents` for these bitmaps. 3. Regenerating bitmaps can take a long time. One possible approach is to add a new mode which only adds new bitmaps for commits introduced between successive bitmap generations. 4. If we solve the above points, we would think of other aspects like rethinking how we select which commit receives bitmaps etc. ### Previous Work I didn’t find any previous patches regarding point no. 1. But there were some discussions about this. Derrick’s (comment)[https://lore.kernel.org/git/1685a358-f033-64e0-2e12-df3a1c10af19@xxxxxxxxx/] suggested considering the (Roaring+Run)[https://roaringbitmap.org/about/] technique. It is much faster than the currently implemented (EWAH)[https://arxiv.org/abs/0901.3751]. Taylor initiated the `developing a table of contents for bitmap` work by creating a (branch and committing some patches)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table] in that branch. From the (discussion)[https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/] and the performance test results provided in that discussion, it seems that it would improve overall performance. So, this would work as a reference for me while developing the table of contents for bitmaps (obviously after running performance tests and discussing the best approach). ### Potential Problem The biggest difficulty of this project would be to replace the EWAH compression/decompression altogether with better techniques like the Roaring+Run method. I have to be aware of it so that it doesn’t generate any unintentional behavior or bugs. ### The plan As this project includes three ( actually four) sub projects, I would address them one by one - 1. Firstly I will understand the working of EWAH compression better. I have gone through their documentation and have seen git’s implementation code for it. The commit messages were really helpful here. But I think I should research more. After that, I can compare (and discuss) each approach with respect to the EWAH approach on some parameters (like amount of computation in each approach, simplicity, memory usage, correctness i.e. does it always give the correct output etc.). This comparison would lead me to take the most ideal approach in this case. There are several techniques which may work/perform better than the current EWAH implementation. I have researched some of these techniques. One of them is obviously the Roaring+Run method. I have gone through their (research paper)[https://arxiv.org/pdf/1709.07821v6.pdf] on this technique; it seems like a good approach to me. Other than that, there is a golang package called (sroar)[https://dgraph.io/blog/post/serialized-roaring-bitmaps-golang/] (i.e. Serialized Roaring bitmaps), which is an open source package ( mainly built for (Dgraph)[https://dgraph.io/]) and the creator of this package claims it is (7x faster using 15x less memory)[https://github.com/dgraph-io/sroar#benchmarks]. It doesn’t require any encoding/decoding step. I would look into it ( It needs to be implemented in C language though). I have also looked into approaches like (bloom filter)[https://en.wikipedia.org/wiki/Bloom_filter], (Hyperloglog)[https://en.wikipedia.org/wiki/HyperLogLog]. These are very fast and use very less memory. But they give probabilistic answers and sometimes they may give wrong answers. So, those would not be a good fit for us I guess. 2. I will understand the bitmap related files such as `pack-bitmap-write.c`[4], `pack-bitmap.c`[5] etc. It would help me to decide where and how I can add the code related to `table of contents` for bitmap. Taylor’s (branch work)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table] would work as a reference for me. As the Project idea section suggests, I would also include performance tests to continuously check whether my approach is really improving the performance or not. 3. As I understand more about bitmap related files, I will surely be able to think whether `adding a new mode` is really necessary or not (It is conceptually a good idea though). If necessary, then I would first make the flow design and discuss it with the community. If all is good, I will start working on it. Creating and running performance tests is a must. 4. While developing all of these, my understanding of bitmap and its implementation will get better. So, that would help me to think more practically about the questions provided in the `Project Idea` section. Then I would work on it accordingly. ## Estimated Timeline My Semester exams will most probably start from May last week. So, I may be inactive (or less active) that time. (Smart India Hackathon)[https://www.sih.gov.in] (semi finals and finals) may also be held during the GSoC event. I will get you notified about that. --> Period - Pre GSoC Period - Till May 20 --> Tasks - Research more on the bitmap files, compression techniques etc. - See if there are any other (better) techniques for it. --> Period - Community Bonding - May 20 - June 12 --> Tasks - Discussion on the project started - Decide which approach is better for compression/decompression of bitmaps. - If a decision is not taken, create branches for each approach. - Create branches for each sub project and make skeletons for them. --> Period - Coding Phase 1 - June 13 - July 25 --> Tasks - Start working on sub-project no. 1 (i.e. bitmap compression technique) - Add Performance test - If multiple branch are create for this sub-project, compare the test results - Take the best performing approach, delete other branches. - Update documentation accordingly - Fix bugs - Finish sub-project 1 --> Period - Coding Phase 2 - July 25 - Sept 4 --> Tasks - Fix sub-project-1 related small bugs - Start working on sub-project-2 (i.e. adding `table of content` for bitmap files) - Add Performance test - Add Documentation - Fix bugs - If finished, start working on sub-project-3 (i.e. adding a `new mode`) - Try to finish it with documentation and performance tests --> Period - Final Week 1 - Sept 5 - Sept 12 --> Tasks - If both are finished, fix bugs and test overall performance. Else extend the period. - Discuss about the last sub project (i.e. questions ) - If anything needs to be changed/modified or added, extend the GSoC Period --> Period - Coding Phase 3 - Sept 13 - Nov 13 --> Tasks - Finish code related to all sub projects - Add overall documentation - Add overall performance tests ## Blogging about Git I love to write blogs and technical articles. Those are my ways to connect and communicate with the developers community. I have previously written articles about Django in a portal named (GeeksForGeeks)[https://auth.geeksforgeeks.org/user/riter79/articles]. Recently I started writing blogs on (Medium platform)[https://medium.com/@abhra303]. During the GSoC event, I will frequently write blogs about my progress. I will also share the problems I face and the solutions of those problems that I consider. I got interested in git by reading Olga’s GSoC blogs. That led me to contribute to this codebase. So, someone like me will hopefully be motivated by seeing my blogs ;-) ## Availability I intentionally freed my summer slot for GSoC. So, there wouldn’t be much disturbance/inactive days while working on my project. Generally, I can spend nearly 35-40 hours a week on this project. If the college gets closed for some reason, then the amount of available time will increase. Smart India Hackathon finale would be held in July. So, I may be inactive for a few days. But overall, I will be available most of the time. ## Post GSoC Once the GSoC finishes, I would look into other issues/tickets and would send PRs solving those issues. As I said before, I want to be a long term contributor to git - I would definitely maintain the flow and continue to work on improving the overall git tool. ## Experience with Open Source I have a pretty good contribution history in open source. The organizations/tools where I contributed to are - - Mozilla/bedrock - written mainly in Python Django and javaScript link - https://github.com/mozilla/bedrock/commits?author=Abhra303 - Datree - a Kubernetes misconfiguration checker tool written mainly in golang link - https://github.com/datreeio/datree/commits?author=Abhra303 - KubeBuilder - A tool to extend the Kubernetes API by creating custom CRDs - written mainly in golang. I recently started looking into it. link : https://github.com/kubernetes-sigs/kubebuilder/issues/2567 ## Motivation I always wanted to develop/contribute to projects that affect millions of people. Git is a tool that most ( or maybe every) developers use for their daily work. Contributing to it means my written code will be used behind the scenes in every developer’s machine. That’s my motivation. ## Closing Remarks Lastly, I want to say that I am a dedicated, research preferred quick learner. I am always ready to solve real world problems and learn required skills for them. For example, I recently took part in a hackathon called “Cloud Native Hackathon” and won 1st place in the category of “best use of Datree”. I knew nothing about (Datree)[https://datree.io/] ( and (Kubernetes)[https://kubernetes.io] partially) before this hackathon and the hackathon was only 2 days long. But I researched hard ( to know the Kubernetes misconfigurations) and learned jsonSchema ( Here is my (blog link)[https://medium.com/@abhra303/how-i-won-cloud-native-hackathon-2021-4b1f5b2a84f2] if you want to know my full experience). I hope you’ll give me a chance to make an impact among the developers community by considering my application. Thanks & Regards, Abhradeep Chakraborty [1] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstContribution.txt [2] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstObjectWalk.txt [3] https://git-scm.com/docs/user-manual#hacking-git [4] https://github.com/git/git/blob/master/pack-bitmap-write.c [5] https://github.com/git/git/blob/master/pack-bitmap.c