Hello all, I completed my proposal and I'm sharing it with you. Even thought it's close to the deadline I hope it gets reviewed. I'm looking for more related material like mailing list conversations to include in my proposal and learn more. How can I further improve it? Have I missed something? I know I didn't include any links here. The PDF version I submitted has links. Thanks in advance, I'm sorry for the late submission, Plato ~~~~~~~~~~ # Contact info [...] # About me Final year undergrad @ University of Athens, majoring Computer Science and minoring Mathematics. Throughout my studies I was selected to help conduct uni lab lectures voluntarily for courses: data structures in C (4 years), oop in C++ (2 years). GPA: 8.4/10. I was a founding member of my university department’s ACM student chapter, with the goal to grow and bond our uni community by organizing lectures and workshops by students, academic researchers and professionals. Mostly I was active in open source channels where I directed, along with others, conversations about how to get involved and contribute to open source projects. We even launched our own dummy project to help provide aided exposure with PRs, reviews and related procedures, which was accompanied by a talk I conducted (pt. 1, pt. 2) regarding the workflow with Git & Github. Covid outbreak hindered the development of the ACM student chapter. My computing world volunteerism does not stop there as I helped construct the website & autograder for the cryptography course in ‘15 for NTUA ECE, a different university from the one I’m enrolled in. I also gave a talk for first-year undergrads on how to switch from Windows to Linux and why, including its advantages and disadvantages. In late 2019 I was recommended by a prof. to work in my uni department to develop systems for academic and educational purposes. I focused on setting up and synchronizing LDAP databases, SCIM identity management, LSC and maintenance of our GitLab instance. # Benefits to Community I plan to become a long term Git contributor, help develop and influence its design and attract new contributors. I plan to give a talk to my fellow undergrad and graduate classmates about my experience in GSoC with Git. How they can become involved, the GSoC program, Git’s codebase structure and how things are done within the Git organization. I will invite them to participate in GSoC ‘23 with Git where, hopefully, I will mentor the selected candidates. My participation with GSoC this year will lay the foundations for me to learn and accumulate the necessary experience and understanding so I can give back to GSoC, Git, my university and fellow classmates. Throughout my GSoC journey, and after it, I’ll publish digests about my bitmap work and related knowledge under kioplato.github.io, similar to the ones in Git Rev News. I’ll also PR related parts to Git Rev News. This way my work will be transparent and accessible to anyone who’s interested. It will also serve as reference throughout my GSoC journey and towards future GSoC candidates interested in participating with Git organization. Improving Git will benefit open source projects and their communities as well, increasing productivity due to reduction in wait times as the proposed project will, hopefully, yield non-trivial speed-up over network operations. Git is a widely used technology throughout the industry and further developing it will benefit everyone who’s trying to improve our lives through technology. # Microproject & Git related work – In 2019, I planned to participate in GSoC ‘20. To familiarize myself with Git and get involved early, I picked up this microproject from GSoC ‘19 and submitted a patch converting remove_subtree() function from entry.c to use dir-iterator API instead of opendir, readdir, closedir API. This function is required to iterate directory paths after their contents, rendering my patch invalid, as pointed out by Matheus, since dir-iterator didn’t and still doesn’t offer such functionality. I didn’t follow up with a dir-iterator API patch to implement the required functionality nor did I attempt to participate in GSoC ‘20 due to having limited time. – My dream to participate in GSoC continues and I plan to participate in GSoC ‘22. I picked up from where I left off and introduced the first set of changes requested by this thread to dir-iterator and submitted RFC v1. The first set of changes refer to dir-iterator iterating directories before and/or after their contents. My goal is to considerably improve dir-iterator, guided by the needs of its possible customers like read_directory_recursive(), resulting in a simpler, more maintainable codebase. Getting selected for GSoC will put this goal on hold. – While reading through the SoC documentation in git.github.com I noticed the CI section of the General Microproject Information webpage wasn’t updated after the switch from Travis-ci to Github actions. The CI instructions were out of date, pointing to Travis-ci. I builded the site, updated the section and opened a PR to git.github.com repo, which was merged. # Proposal Abstract Pre Git v2.0.0, network operations like fetches and clones were hindered by the counting objects phase. In this phase, Git needed to calculate which objects are required to be sent. However, Git only knows the tips of all branches, so its only option was to walk down the graph. To deal with this, Git adopted JGit’s reachability bitmaps in EWAH format. This method led to a significant speed-up over the network operations, as it indexed which objects are reachable from salient commits. While EWAH bitmaps eliminate a performance bottleneck by indexing which commits are reachable from a given commit, there are possible improvements to be made. There is a sense that EWAH bitmap compression is slow in some cases. Are Roaring bitmaps better in Git’s case? There is strong evidence they will be. Are Roaring bitmaps faster to access specific parts of the bitmap due to its random-access feature and how can we use this? Is the current bitmap format sufficient to incorporate a lookup table which specifies the commits which have a bitmap? How can we improve it? This project attempts to answer these questions. The proposed project would entail a) Designing & implementing a suite of performance tests benchmarking EWAH to validate our assumption that the performance of bitmap decompression is worth pursuing. Compare the two bitmap compressions and perform any necessary changes to the .bitmap format to accommodate the new compression scheme. Speeding-up read, write bitmap operations or reducing used memory will convince us to switch to Roaring bitmaps. b) Further explore the idea of introducing a small “table of contents”, indexing which commits have a bitmap and where to find them in the .bitmap file. In this case we will need to implement a new .bitmap format to add a variable-width table of contents. # Proposal Timeline The proposed project is challenging and in combination with its experimental nature requires diligence and careful planning. I need to be realistic. I will need time to familiarize myself and be confident in my understanding regarding Git’s implementation related to bitmaps, packing objects and related material, before proceeding with the actual changes to .bitmap format and performance testing. These require time and effort, therefore I’ll reserve sufficient time to get them right: I’ll dedicate 35-40 hours/week. I will reserve more if the project requires it and I’m open to extending the duration of the program in case the proposed goals are not achieved. ## Overview The plan is to first validate the need for an alternate bitmap compression algorithm. This will be verified from the performance test suite. → If there is indeed a need, we will attempt to adopt Roaring bitmaps instead of EWAH, and compare the two implementations. Hopefully Roaring bitmaps will speed-up bitmap operations and convince us to adopt. → Regardless of the necessity of an alternate bitmap compression algorithm, the idea of introducing a “table of contents” to the bitmap format will be explored, even post GSoC. If the performance tests prove that EWAH is good enough, then the “table of contents” along with a new .bitmap format will become my primary focus. 1. Spend time understanding bitmap data structure, EWAH, Git’s bitmap format, codebase and respective implementations. Self code and tinker with codebase. Don’t be afraid to ask questions and make sure it’s clear to me what’s expected. Material: Bitmap EWAH Bitmap format Bitmaps & inverted lists Git: 1 2 3 4 5 2. Design & implement a suite of performance tests. Validate our assumption that the performance of bitmap decompression is worth pursuing. Material: Git performance tests Taylor’s .bitmap format: mail-list fork 3. Implement Roaring bitmap with run containers. Think about the architecture & design. Compare it against the existing implementation. Adopt if bitmap operations are faster. Material: Roaring paper Roaring spec Bitmap format Git: 1 2 3 4 5 4. (alternative) Dedicate some time to investigate and sketch out a new .bitmap format for the “table of contents” sub-project. Current bitmap format comes with limitations. Material: Bitmap format Taylor’s .bitmap format: mail-list fork Stolee’s chunk-format API 5. (alternative) Implement a new bitmap format to incorporate the variable-width “table of contents” metadata. Implement and run related tests to verify speed-up. Material: Bitmap format Taylor’s .bitmap format: mail-list fork Stolee’s chunk-format API ## Timeline - April 19: GSoC contributor application deadline - April 20 - May 08 (~3 weeks, 19 days, 120h): Understand codebase Tinker with and understand Bitmap, EWAH, Git’s EWAH, Bitmap format. Local testing and self code. Gather a good understanding of Git’s codebase related to bitmaps in general 1 2 3 4 5. Explore related Git commands like git gc, pack-objects, unpack-objects, repack, bundle and other related commands. See which code paths trigger and understand what’s happening. Ask questions to mentors Taylor & Sivaraam and wider audience. Read documentation. - May 09 - May 13 (~1 week, 5 days, 40h): Understand performance tests library Understand the architecture of Git’s performance tests & existing bitmap performance tests. Investigate which code paths trigger in the codebase. Consult Taylor’s performance test parts: mail-list to aid with understanding of how the performance tests work. - May 20: Accepted GSoC contributor projects announced & Community bonding period begins - May 14 - June 03 (3 weeks, 21 days, 120+h): Compression perf tests & validate assumptions Design and implement a suite of performance tests. Consult existing tests, README, conversations, Taylor’s previous work on .bitmap format: mail-list fork. Validate our assumption that the performance of bitmap decompression is worth pursuing. Discuss possible new .bitmap format. Tie up loose ends. - June 13: Coding officially begins - June 04 - July 03 (4 weeks, 30 days, 160h): Implement Roaring & compare to EWAH Implement Roaring bitmaps with run containers. Understand Roaring specification and other compressions. Self code and local testing to understand Roaring format. Read Roaring paper. Minimally implement Roaring bitmaps and compare against existing compression. Adopt if bitmap operations are faster. Discuss possible new .bitmap format. Tie up loose ends. - July 04 - July 24 (3 weeks, 21 days, 120h): Bitmap perf tests & further discuss bitmap format Consult Taylor’s previous work on .bitmap format with commit lookup table: mail-list fork. Implement more tests to identify & discuss possible improvements to be made to the current .bitmap format. Investigate how we can incorporate a variable-width table of contents to a new .bitmap format. How does the way we read hash-cache extension limit the current .bitmap format? Can we make it compatible with Stolee’s chunk-format API? Further understand current .bitmap format. Understand pack objects, deltas, related operations & code paths. Self code and tinker. - July 29: Phase 1 Evaluation deadline: Submit code - July 25 - July 29 (5 days): Submit phase 1 evaluation - July 30 - Aug 21 (4 weeks, 160h): Implement new .bitmap format ideas & run perf tests Experiment with new .bitmap format ideas and document possible speed-up. Continue discussions related to .bitmap formats. Consult previous conversations. Hopefully arrive at a satisfying new .bitmap format. Think about its architecture & design. - Aug 22 - Sep 04 (2 weeks, 80h): Unallocated time in case of delays & bad time management - Sep 05 - Sep 12 (1 week): Submit final work product & mentor evaluation. The above timeline serves mostly as a guideline. The overall pace and progress of the project depends on intermediate steps like “Compression perf tests & validate assumptions”. There is a plethora of work to be done and possible improvements to be discussed in this area. That’s the purpose behind the multiple subprojects of the proposed project i.e. additional flexibility. As a possible GSoC candidate, I want to successfully complete GSoC and get this project right. I want to explore bitmaps, their compressions and offer as much as I can. For this reason I’ll take advantage and start preparing from 20 April as the proposal timeline suggests. This will give me additional flexibility, given the nature of the project (e.g. in case EWAH compression is good enough), and confidence since I’ll have more time to work and understand underlying implementations. Even if I’m not selected, these conversations and my exposure to bitmaps will help the selected candidate. Additionally, I will try to participate in the process and help the selected candidate. # Related Work There is strong evidence that Roaring bitmaps, usually, perform better than its alternatives: – Guzun et al compared Concise, WAH, EWAH, Roaring (without run containers) with a hybrid system combining EWAH and uncompressed bitsets. In many of their tests, but not all, Roaring was superior to Concise, WAH and EWAH. – Wang et al compared a wide range of bitmap compression techniques. They conclude their comparison by a strong recommendation in favor of Roaring: “Use Roaring for bitmap compression whenever possible. Do not use other bitmap compression methods such as BBC, WAH, EWAH, PLWAH, CONCISE, VALWAH, and SBH.” – Lemire et al reviewed and evaluated the CRoaring library. They compared BitMagic, EWAH (code from this library is part of the Git version-control tool), uncompressed bitset, WAH & Concise, C++ std::vector and std::unordered_set for hash-set performance. In their tests Roaring performance was superior. They concluded by saying “No single approach can be best in all cases, but Roaring offers good all-around performance”. – CRoaring is an implementation of the Roaring bitmaps in C and C++. Will be used as reference while adopting Roaring bitmaps in Git. – Bitmap implementation in C. Excellent for getting introduced into the bitmap data structure. Will be used as reference. – Roaring format specification thoroughly describes the compressed bitmap format. Will be consulted throughout the project’s duration. – EWAH implementation in C. Excellent for getting introduced into how things are currently implemented. Git’s EWAH implementation will complement that. – Roaring bitmaps publication: Consistently faster and smaller compressed bitmaps with Roaring. Will be used as reference. – Comparison between bitmap compressions: An Experimental Study of Bitmap Compression vs. Inverted List Compression. Will be used for additional context since it provides a brief explanation of all bitmap compressions. – Taylor’s previous work on .bitmap table of contents: mail-list fork Taylor found that we could get a significant speed-up in some cases, but that the speed-up was basically obliterated whenever we had to do any follow-on traversal if the bitmap coverage wasn't completely up-to-date. Previous work that introduced EWAH bitmaps could be used for possible additional context, exposure, related conversations, code paths and reference. – Initial patch series, designed by Shawn Pearce for JGit, implemented by Vicent Marti – Initial patch improvements by Jeff v1, v2, v3, v4. – Merged initial patch: compressed bitmaps, use bitmaps when packing objects Related Github blog posts: Counting objects, Scaling monorepo maintenance # Biographical Information – Me and Git I have been using Git since 2015, in my work and to collaborate on uni and fun projects. Throughout these years I’ve read the entire pro Git book while following most parts through my terminal, reproducing the behavior, in order to make me a better software engineer. The workshop I conducted (pt. 1, pt. 2) regarding the workflow with Git & Github was heavily influenced by pro Git book which is apparent in the workshop’s pdf. I’ve also read material like hacking Git, github.blog posts, Git Rev News and similar technical documents. – Me, C and Shell C is my native programming language. I have written 80.000+ LOC and I’ve implemented various systems. Most notably: B+ Tree, LSH for KNN & clustering, simple search engine, graph editor, MPI & OpenMP parallel system, threaded server, multiprocess systems with workers, archiver, many algorithms & data structures (kd-trees, red-black, AVL, splay tree, linked-lists, skip lists, hashtable, etc), many of which I’ve converted to be thread-safe. In my uni we’re C/C++ oriented. I’ve been using Arch Linux exclusively for 4+ years to learn and tinker with my setup. I run dwm, st, self compiled Linux kernel, various shell scripts, btrfs, full disk encryption, lvm to configure my computer just how I like it. I’m confident in my C & shell skills. – Professional Experience Jan 2020 - June 2021 (1.5 year) GUnet Worked on setting up and synchronizing LDAP databases, SCIM identity management, LSC and maintenance of our GitLab instance. – Teaching 2019: Workshop regarding the workflow of Git & Github. (pt. 1, pt. 2) 2017: Workshop for first-year undergrads on how to switch from Windows to Linux and why, including its advantages & disadvantages and related utilities like IDEs, compilers, packages. 2016 - 2020: Lab assistant courses data structures in C (4 years), oop course in C++ (2 years).