Hello everyone, Sorry about the wrapping mistake in last patch. Here is the initial version of my proposal on "More Sparse Index Integrations". I would really love to have comments on it. If you find any mistakes please notify me about that. And website version is there: https://docs.google.com/document/d/1WtoLgAJYVHY_NWNscqi358FeAY-6UBARCmpTU3BAQMs/edit# ================================================================== ## More Sparse Index Integrations Personal Details ### About Me | Name | Shuqi Liang | | Mobile no. | +1 (416) 272-1737| | Email | cheskaqiqi@xxxxxxxxx| | Github | https://github.com/Cheskaqiqi | | Major | Computer Science | | Time Zone | EDT (UTC -4 hours) | ### Me & Git I am a Computer Science undergraduate at Western University (Canada). I have learned C, Python, Java, and shell in the last two years. I have started exploring the Git codebase since Jan 2023.Here is some related documentation I have read: # MyFirstContribution.txt # SubmittingPatches *Way to make contributions to the git project. # MyFirstObjectWalk.txt *Way uses git's Object Walk API to traversethe git object database. Topics such as object types and IDs. # Hacking Git # Understanding Git — Index # Git Internals # UnderstandingGit DataModel # UnderstandingGit Branching # Underlying data model of git and how it works. *Types of git objects: blobs, trees, and commits. *The concept of branching in Git and how it is used to manage code changes. *Two layers of git: the low-level plumbing layer and the high-level porcelain layer. *Relationship between the Index and the working directory. # Make your monorepo feel small with # Bring your monorepo down to size with sparse-checkout *How the git sparse checkout and git sparse index feature can help make large monorepos feel smaller and more manageable # Sparse-checkout.txt # Sparse-index.txt I would like to thank Shaoxuan, the 2022 GSoC who recommended these two articles to me. The articles cover some of the limitations and potential issues with ‘sparse-checkout’ and ‘sparse-index’. ###Contributions #Status wip #Subject [Microproject]: t4113: modernize test script #Mail List Link https://lore.kernel.org/git/20230215023953.11880-2-cheskaqiqi@xxxxxxxxx/ #Status wip #Subject [Aided a potential GSoC student]: trace.c, git.c: removed unnecessary parameter to trace_repo_setup #Mail List Link https://lore.kernel.org/git/20230215175510.3631-1-cheskaqiqi@xxxxxxxxx/ #Status wip #Subject check-attr: integrate with sparse-index #Mail List Link https://lore.kernel.org/git/20230227050543.218294-2-cheskaqiqi@xxxxxxxxx/ #Status wip #Subject diff-files: integrate with sparse index #Mail List Link https://lore.kernel.org/git/20230322161820.3609-2-cheskaqiqi@xxxxxxxxx/ ###The Project: More Sparse Index Integrations #What's "sparse-checkout" When the repository has so many files at root, it causes git commands to slow to a crawl (e.g., git checkout, git status). Now, ‘sparse-checkout’ allows users to restrict their working directory to only the files they care about. It is supposed to make users feel like they are in a small repository, even though they are contributing to a large one.[1] If users use the "microservices in a monorepo" pattern, "sparse-checkout" can ensure the developer workflow is as fast as possible while maintaining all the benefits of a monorepo.[1] #What's ‘git index’ In git, the index, or the staging area, is an intermediate area where changes to a Git repository are prepared before committing to the repository. The index file stores a list of every file at HEAD. This list of files is stored as a flat list.[2] #What's ‘sparse-index’ Although ‘sparse-checkout’ has done very well, it still has a problem: the Git index is still large in a monorepo. ‘sparse-index’ allows the index to focus on the files within the sparse-checkout cone. The size of the sparse index will scale with the number of files within users' chosen directories instead of the full size of the repository. When enabled with a number of other performance features, this can have a dramatic performance improvement.[2] #Problem when integrated with sparse index The idea of 'sparse index' is easy to understand, but pruning the index at the directory level may cause a complicated result. That is because, In the Git codebase, numerous places directly interact with the index in various nuanced ways. All of them assume that each index entry refers to a blob object. But sparse-directory entries violate expectations about the index format and its in-memory data structure. Many consumers in the codebase expect to iterate through all of the index entries and see only files. Compatibility layers are made to expand a sparse index to an equivalent full index. So that even if the on-disk format is sparse, code paths that have not been integrated and tested with a sparse index can still be used.[2] The Git Fundamentals team first started by creating the ensure_full_index() method, which converts a sparse index into a full one. But this method takes longer to expand a sparse index to a full one than to read a full one from a disk. It goes against our idea of utilizing 'sparse index' to enhance user experience. To gain the performance benefits of a sparse index and improve user experience, we need to optimize the compatibility and teach git to only expand the sparse directory entries only when needed. ###Plan Every integration will have similar steps, but the actual steps of commands integrated for the project will vary based on the complexity of the commands chosen: (Notice that step 1 is from ShaoXuan's GSoC 2022 Git Contributor Proposal, and steps 2,4-7 are from GSoC 2023 Ideas. I made some additions to the steps.) 1. Investigation around a Git command and see if it behaves correctly with sparse-checkout. Modify the Git command's logic to work better with 'sparse-checkout'. Add corresponding tests.[3] *Step 1 does not often occur. But it is important to ensure *"sparse-checkout" is compatible with the Git command to continue *with the next step. 2. Add tests to t1092-sparse-checkout-compatibility.sh for the built-in, with a focus on what happens for paths outside of the sparse-checkout cone. t1092-sparse-checkout-compatibility.sh create a repository with some data shapes in it. Each test case starts by copying that repository into three new repositories with different configurations. These three are called 'full-checkout' , 'Sparse-checkout' , and 'Sparse-index', respectively : 'full-checkout' is the same as the repository, without sparse-checkout. 'Sparse-checkout' with cone mode sparse-checkout enabled. 'Sparse-index' with cone mode sparse-checkout and sparse index enabled. Add tests for the git command we want to integrate with sparse-index to t1092 without the code change. Focus on the git command, which has the ability to affect full-checkout/sparse-checkout/sparse-index differently. The test case should run against all three repositories and have the same output and it should also work when the index is expanded. After integrating the git command with sparse-index, the output and behavior should remain the same. 3. Add performance tests, so we have a baseline to measure how well the git command does. We need a baseline to measure the speed before integrating the git command with sparse-index. Normally we will notice the speed is quite slow caused by expanding the index. 4. Disable the command_requires_full_index setting in the built-in and ensure the tests pass. 5. If the tests do not pass, then alter the logic to work with the sparse index. Make the code change to only expand the sparse directory entries only when needed.command_requires_full_index setting guards all index reads to require a full index over a sparse index. After suitable guards are placed in the codebase around uses of the index, remove the setting. 6. Add tests to check that a sparse index stays sparse. Add ensure_not_expanded test to t1092-sparse-checkout-compatibility.sh, We expect the index to be expanded for out-of-cone moves. But we need to ensure the index will not expand for in-cone moves. 7. Run performance tests to demonstrate speedup. ###Project Timeline #Empty Period (Present - May 4) My end-semester exams begin on April 4 to April 28. Hence I might be a bit busy in this period. After April 28, I will continue to work on 'git diff-files' and start to work on ' git describe ' #Community Bonding Period (May 4- May 28) Get familiar with the community I have read the related documentation about 'Sparse Index Integrations' and working on 'git diff-files' , one of the builtins that could be integrated with the sparse index. The feedback and the advice for improvement make me learn a lot. And I'm confident I can start the project at the start of this period. Keep working on "git diff-files' and 'git describe' on May 5, and the expected time of completion of these two is May 28. #Phase 1 (May 29 -July 9) week 1 to week 3 (May 29-June 18): integrate 'git write-tree' with sparse index. Use the steps above. week 4 to week 6(June 19 -July 9 ): integrate 'git diff-index' with sparse index,use steps above. #Phase 2 (July 9 - August 28) week 1 to week 3 (July 9 - July 23): integrate 'git diff-tree' with sparse index. Use the steps above. week 4 to week 6 (July 23 - August 13 ): integrate 'git worktree' with sparse index. Use the steps above. week 7(August 13-August 28) : A buffer of one week has been kept for any covert difficulties when integrated with the sparse index. ###Blogging about Git When I was a freshman, I hated writing summaries or other learning materials. But then I started writing blogs for new knowledge to keep track of what I've learned. I realized that When I dive into a topic and want to write it down, I will think much deeper about it than just learning. I also learned a lot and gained many skills from others' blogs. I would love to write about my progress and experiences with the project. In this way, I could share the ideas with those interested in researching this project and help them get up to speed more quickly. ###Availability My semester will complete in late April, leaving me enough time to prepare for my GSoC project. Git is the only project for my summertime. If I am selected, I shall be able to work five days per week, 7 - 8 hours per day, around 35 -40 hrs a week on the project, though I am open to putting in more effort if the work requires. If everything is going well with the plan, I may want to Participate in a hackathon for a few days with my friends in July. ###Post GSoC I have received much support from many members of the Git community in recent months. This support has strengthened my passion for git and inspired me to contribute more of my code to the community. Just as others have helped me, I will pay it forward by assisting and encouraging new community members. I believe that sharing knowledge and collaborating with others is the key to creating great software and achieving success in the open-source world. I am committed to delivering high-quality work and meeting the expectations of the community. I am eager to learn from experienced community members and gain new skills and knowledge that will help me become a more valuable contributor. I am eager to continue working with the Git community beyond the scope of the GSOC program. I believe I have much to offer the community, and I am committed to contributing to its success for a long time. I hope that the Git community will consider my application for the GSOC program. It would be an honor to be able to contribute to such a fantastic open-source project and work with such a supportive and welcoming community. Thank you for your time and consideration. ###References [1] Bring your monorepo down to size with sparse-checkout. https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/ [2] Make your monorepo feel small with Git’s sparse index. https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/ [3]Step 1 in the plan is come from shaoxuan’s proposal in 2022 https://docs.google.com/document/d/1VZq0XGl-wCxECxJamKN9PGVPXaBhszQWV1jaIIvlFjE/edit Thanks & Regards, Shuqi