Re: [RFC][PATCH] GSoC 2023 proposal: more sparse index integration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 26, 2023 at 2:17 PM Vivan Garg <gvivan6@xxxxxxxxx> wrote:
>
> Signed-off-by: Vivan Garg <gvivan6@xxxxxxxxx>
> ---
> This is a rough draft of my proposal for the GSoC 2023 more sparse
> index integration project. I would greatly appreciate any feedback
> anyone has to offer. Thank you in advance!
>
>  .../More-Sparse-Index-Integrations.txt        | 134 ++++++++++++++++++
>  1 file changed, 134 insertions(+)
>  create mode 100644 Documentation/More-Sparse-Index-Integrations.txt
>
> diff --git a/Documentation/More-Sparse-Index-Integrations.txt b/Documentation/More-Sparse-Index-Integrations.txt
> new file mode 100644
> index 0000000000..dbe6da660f
> --- /dev/null
> +++ b/Documentation/More-Sparse-Index-Integrations.txt
> @@ -0,0 +1,134 @@
> +# More Sparse Index Integrations
> +
> +# Personal Information
> +
> +Full name: Vivan Garg
> +
> +E-mail: gvivan6@xxxxxxxxx
> +Alternate E-mail: v.garg.work@xxxxxxxxx
> +Tel: (+1)437-987-2678
> +
> +Education: University of Waterloo (Canada)
> +Major: Computer Science and Financial Management (Double-Major)
> +Year: Rising Junior
> +
> +LinkedIn: https://www.linkedin.com/in/gvivan/
> +GitHub: https://github.com/gvivan
> +Website: https://gvivan.me/
> +
> +# Before GSoC
> +
> +## Synopsis
> +
> +I've chosen the "More Sparse Index Integrations" project idea from the SoC 2023 Ideas page. The goal of this project is to integrate the experimental "sparse-index" feature and "sparse-checkout" command with existing Git commands.
> +
> +Git 2.25.0 introduced a new experimental `git sparse-checkout` command, which simplified the existing feature and improved performance for large repositories. It allows users to restrict their working directory to only the files they care about, allowing them to ensure the developer workflow is as fast as possible while maintaining all the benefits of a monorepo.
> +(Bring your monorepo down to size with sparse-checkout, Stolee).
> +
> +The pattern matching process in Git's sparse-checkout feature becomes expensive as the sparse-checkout file and repository size increase, growing quadratically. This can result in billions of pattern checks for large repositories. However, Git's new mechanism for matching based on folder prefix matches drops the quadratic growth, matching M patterns across N files in O(M+N*d) time, where d is the maximum folder depth of a file.
> +To further optimize the matching process, Git inspects files in a sorted order instead of an arbitrary order. When Git evaluates a file path, it checks whether the start of the folder path matches a recursive pattern exactly. If so, it marks everything in that folder as "included" without doing any further hashset lookups. Similarly, when Git detects the start of a folder that's outside of the specified cone, it marks everything in that folder as "excluded" without doing any further hashset lookups. This reduces the time to be closer to O(M+N) (Bring your monorepo down to size with sparse-checkout, Stolee).
> +
> +The Git Fundamentals team at GitHub has contributed a new feature to Git called the sparse index, which allows the index to focus on the files within the sparse-checkout cone in a monorepo. The sparse index stores only the information about the files within the sparse-checkout definition, instead of storing information for every file at HEAD, which can make the index much larger in a monorepo. When enabled with other performance features, the sparse index can have a significant impact on performance (Make your monorepo feel small with Git’s sparse index, Stolee).
> +
> +The sparse index differs from a normal "full" index in that it can store directory paths with the object ID for its tree object. It can be used to determine if an entire directory is out of the sparse-checkout cone and replace all of its contained file paths with a single directory path. The use of sparse index can significantly reduce the size of the index, resulting in faster operations (Make your monorepo feel small with Git’s sparse index, Stolee).
> +
> +Because "sparse-checkout" and "sparse-index" may potentially influence the logics of other Git commands and the internal data structure of Git, some work is required to optimize compatibility and user experience. That is exactly what my chosen idea proposed.
> +
> +## Benefits to Community
> +
> +By joining the community and working on this idea, I can collaborate with my mentor and fellow community members to improve the user experience for people who are working with large monorepos. Furthermore, I am committed to continuing my involvement beyond the GSoC program, not only by contributing to the community but also by sharing my experiences and mentoring future potential newcomers.
> +
> +
> +## Microproject
> +
> +t4121: modernize test style
> +Status: ready to merge
> +Description: Test scripts in file t4121-apply-diffs.sh are written in old style,where the test_expect_success command and test title are written on
> +separate lines. Therefore update the tests to adhere to the new style.
you should include a link here to Junio's what's cooking in git where
your contribution is mentioned.
> +
> +## Other Contributions
> +
> +### Reviewing
> +
> +t9700: modernize test script
> +Status: WIP
> +Description: I reviewed this patch and pointed the contributor in the right direction by providing examples, links and mentioning the best practices.
> +
> +### Patches
> +
> +MyFirstContribution: add note about SMTP server config
> +Status: WIP
> +Description: The documentation on using git-send-email previously mentioned the need to configure git for your operating system and email provider, but did not provide specific details on the relevant configuration settings. This commit adds a note specifying that the relevant settings can be found under the 'sendemail' section of Git's configuration file, with a link to the relevant documentation. The aim is to provide users with a more complete understanding of the configuration process and help them avoid potential roadblocks in setting up git-send-email.
> +
> +
> +### Related Work
> +
> +Prior works on the idea have been completed by my mentors and other community members, and these works provide a good approximation of the approach I intend to take. Here are some previous examples of commits:
> +Integration with “mv”
> +Integration with “reset”
> +Integration with “sparse-checkout”
> +Integration with “clean”
> +Integration with “blame”
> +
> +# In GSoC
> +
> +## Plan
> +
> +The proposed idea of increasing "sparse-index" integrations may seem straightforward at first glance. However, upon reviewing previous implementations, I discovered that this idea can introduce unforeseen difficulties for some functions. For example, to enable "sparse-index," we must ensure that "sparse-checkout" is compatible with the target Git command. Achieving this compatibility requires modifying the original command logic, which can lead to other unanticipated issues. Therefore, I have incorporated some additional steps in the plan outlined below to proactively address potential complications. It's worth noting that points 3-7 are part of the SoC 2023 Ideas proposed by the community and mentors.
> +
> +1. Conduct an investigation to determine if a Git command functions properly with sparse-checkout.
> +
> +2. Modify the logic of the Git command, if necessary, to ensure it functions properly with sparse-checkout. Develop corresponding tests to validate the modifications.
> +
> +3. Add tests to t1092-sparse-checkout-compatibility.sh for the builtin, with a focus on what happens for paths outside of the sparse-checkout cone.
> +
> +4. Disable the command_requires_full_index setting in the builtin and ensure the tests pass.
> +
> +5. If the tests do not pass, then alter the logic to work with the sparse index.
> +
> +6. Add tests to check that a sparse index stays sparse.
> +
> +7. Add performance tests to demonstrate speedup.
> +
> +8. If any changes are made that affect the behavior of the Git command, update the documentation accordingly. Note that such changes should be rare.
> +
> +## Timeline
> +
> +During my discussion with Victoria, she informed me that given my commitment of 175 hours, it is expected that I will be able to fully integrate two commands with sparse index during the GSOC program. My plan is to evenly distribute the work for each command over the course of the program. I am confident that I can start the project early as I have already established communication with my mentors and familiarized myself with the related documentation, although my understanding may not be comprehensive.
> +
> +Based on my prior experience with the idea, I believe I will be able to quickly get up to speed and begin working on the project. The exact timeline for each integration is difficult to determine, but I estimate that I should be able to complete one integration every two months. I have already planned out my next term, and there are only three weeks during which I would prefer to focus on other things: June 23-30 and August 1-15. However, even without an extension, I should be able to manage this timeline. With the flexibility to extend the program, it should be even easier to accommodate any potential scheduling conflicts.
> +
> +
> +## Availability
> +
> +I will respond to all communication daily and will be available throughout the duration of the program. Although I will be taking some summer courses at my university, I will not be enrolled in a typical full course load. As part of GSOC, I plan to commit to 175 hours. I have experience managing my time effectively while taking courses and working full-time internships in the past. My semester ends on August 15th, and I have no commitments for the following month, which allows me to continue working beyond the end of the semester. With the flexibility to extend the timeline of GSOC, I am confident that I will have ample time to complete the project. I have already discussed this with Victoria, the mentor for the project, and she has agreed to extend the deadline until October 2nd, if necessary. After August 15th, I will be able to work at least 8 hours per day, totaling ~360 hours of work until the October 2nd deadline. This exceeds the required commitment of 175 hours, ensuring that I will complete the project on time. Additionally, I am hoping to continue working on the project even after GSOC ends.
> +
> +# After GSoC
> +
> +I recognize the value of having our GSoC participants continue to engage with our community beyond the event. This is why I am committed to doing so myself. Participating in open-source projects, especially with a community that supports a widely-used development tool, is not only cool but also offers an opportunity to learn and grow. By continuing to participate in this community, I believe that I can make important contributions and continue to develop my skills.
> +
> +I am planning to establish an open source club at my university in the near future. The University of Waterloo is known for its strong emphasis on computer science and engineering, earning it the nickname "MIT of the North." Given this, I believe that there will be a great deal of interest in the club for a variety of reasons. Currently, there is another club called Blueprint that provides a valuable opportunity for real-world development experience through developing software products for charities. However, the entry process for this club is extremely competitive. By contrast, I think that an open source club would offer a similar experience but with a lower barrier to entry, thus making it accessible to more motivated students. Additionally, given the widespread use and vibrant community of Git, I plan to direct students to this community and am confident that many will be interested in contributing to its open source projects.
> +
> +# Some Credits to Myself
> +
> +I’ve previously completed three software developer internships and worked with small startups to large sized companies. I am currently interning with Morgan Stanley and am on the architecture team, working on a large scale equity management software.
> +
> +I'm interested in open source development as a way to give back to the community while also growing as a developer. My background in C programming language has made me particularly interested in contributing to Git, which is primarily written in C. I am also comfortable with concepts like memory allocation, thanks to my experience with C programming. Furthermore, I have studied shell scripting as part of my coursework, which makes me well-equipped to handle the project's language requirements. Another personal motivation for contributing to this project is that I have worked with monorepos before, and given that it is used by many of the larger tech companies, I want to learn more about it and help improve the user experience with it.
> +
> +Victoria mentioned that I was the first person to express interest in the project this year, either directly or via the mailing list. In my spare time, I've been contributing and reading documents while also working a full-time job (internship) and taking one course at my university. I expect to have a lot more time next term, so you can expect even more from me ;). Nonetheless, I became familiar and comfortable with the contribution process by writing, responding to, and auditing various types of patches in the community.
> +
> +With the patches I have submitted so far, I have been able to develop a deeper understanding of Git internals, project structures, commonly used APIs, test suites, required tech stacks, and coding guidelines. To further enhance my comprehension of Git, I have either read or skimmed through several relevant documents, including 'Submitting patches', 'Coding guidelines', 'Myfirstcontribution.txt', 'Git tutorial', 'Git everyday', 'readme', 'Hacking Git', drawing upon my prior knowledge where applicable. Additionally, I have been referring to the book 'Pro Git' on an as-needed basis. Furthermore, I have thoroughly read and referenced blogs such as 'Make your monorepo feel small with Git's sparse index', 'Bring your monorepo down to size with sparse-checkout', and 'Commits are snapshots, not diffs'. The advantage of having prior knowledge and experience with my proposed project idea is that I am well-prepared to tackle any upcoming challenges.
> +
> +# Closing remarks
> +
> +I am very motivated for this project because I have previously worked with monorepos and will most likely have to work with them again in my future internships. As a result, I intend to continue working on remaining commands after GSOC whenever I have free time.
> +
> +I'd like to state that I'm a genuinely enthusiastic open-source newcomer who is very much looking forward to this opportunity. I am grateful for the opportunity to contribute to Git's development, and I am committed to working diligently to strengthen the open-source ecosystem. My ultimate goal is to use this opportunity to bring new energy and ideas to the table, and to make meaningful contributions that benefit the entire community.
> +
> +I am grateful for the community's support, especially Victoria's guidance and feedback. She promptly replied to my inquiries and provided me with several resources that were instrumental in helping me get started on the project. I am truly humbled by the dedication and hard work that the community puts in to nurture and enhance this ecosystem, and I feel fortunate to have received such warm and welcoming support as a new contributor. It is an honor to be a part of this community and to work towards advancing its mission.
> +
> +Thank you so much for reading through my proposal!
> +
> +Kind Regards,
> +Vivan Garg
> +
>
> base-commit: d9d677b2d8cc5f70499db04e633ba7a400f64cbf
> --
> 2.37.0 (Apple Git-136)
>




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux