Hi everyone, This is my proposal for the project "Implement consistency checks for refs". The web version can be viewed at https://docs.google.com/document/d/1pWnnyykGmJIN-wyosZ3PtueFfs_BRdvJq-cwroRorBI/ ---------- # Implement consistency check for refs ## Personal Information + Name: Jialuo She + Email: shejialuo@xxxxxxxxx + Mobile no.: (+86) 13019582712 + Education: Xidian University, Shanxi, China + Year: Final Year + Degree: Master in Software Engineer + GitHub: https://github.com/shejialuo + Blog: https://luolibrary.com/ ## About Me I am someone who loves coding, and my favorite quote is from "The Shawshank Redemption": "Get busy living, or get busy dying". In my final year as a student, I hope to continue doing valuable things. I have some basic open-source experience, such as improving documentation for Angular [1]-[2], adding some features to the community [3]. I also maintain some open-source projects, mainly course codes, book answers, and some practical in-action tutorials to help others [4]. Additionaly, I write blogs to document and help others with some of the public courses I’ve taken [5]. I have written a toy git by myself due to my interest [6]. I am currently interning at NVIDIA, where I started in the software department working on CUDA Test Development. My current position is in the hardware department as a Tegra System Architect, and I will become a full-time employee this July. I hope to use this opportunity to continue participating in the git community and contribute to open source in the future. ## Overview and Background As [7] shows, when the "packed-refs" file gets corrupted, the git-fsck(1) command cannot detect this situation where the "packed-refs" file contains corrupted binary zeros. The current mechanism for git-fsck(1) command can be classified into two parts. The first part is to check the object database. For every loose file which is under `$GIT_DIR/objects`, it will use the internal method `for_each_loose_file_in_objdir` to check whether the result of the content-hash is consistent. The second part is to check the "packed-refs" file, it will open and parse it to iterate the refs which provides an implicit consistency check. However, as [8] shows, when there are some null bytes, maybe the read file operation will just simply ignore those null bytes. It turns out that the current git-fsck(1) command mainly focuses on checking the consistency of the object database. It lacks the functionality to check the consistency ref database explicitly. The upcoming retable backend will soon be released which is a binary format which would be hard for users to check the corruption. And this project aims at implementing consistency check for refs suggested by [8] and [9]. And there are two backends for git: one is file and another is retable. Initially these checks may only apply to the "file" backend and then apply to the "retable" backend. The expected project difficulty is medium and the expected project size would be 175 hours or 350 hours depending on whether to implement consistency check for "retable" backend. ## Pre-GSoC I first got in touch with Git codebase in February, 2024. I have only submitted a minor patch as the microproject to get familiar with the workflow of the Git community. ### MicroProject t9117: prefer test_path_* helper functions Mailing list thread: https://lore.kernel.org/git/20240304095436.56399-2-shejialuo@xxxxxxxxx/ Status: merged into master Description: test -(e|d) does not provide a nice error message when we hit test failures, so use test_path_exists, test_path_is_dir instead. ### Discussion I have discussed with the community for some implementation ideas in [10] which is the fundamental of the "Proposed Project" part. ### Learning the Source Code During the time from completing the microproject until now, I have primarily read the source code related to `fsck.c` and ref internals. I have summarized the implementation diagram of the ref mechanism as shown in [11]. ### Previous Work This project might be a new idea here. It is proposed mainly due to the new-released retable backend [12]. ## Proposed Project As [10] shows, Patrick wrote: > Some ideas from the top of my head: > > - generic > - Ensure that all ref names are conformant. > - Ensure that there are no directory/file conflicts for the ref > names. > - files > - Ensure that "packed-refs" is well-formatted. > - Ensure that refs in "packed-refs" are ordered lexicographically. > - Check for corrupted loose refs in "refs/". > - reftable > - Ensure that there are no garbage files in "reftable/". > - Ensure that "tables.list" is well-formatted. > - Ensure that each table is well-formatted. > - Ensure that refs in each table are ordered correctly. ### Generic It's easy to ensure that all ref names are conformant. The git-check-ref-format(1) provides the functionality to check whether the name of a ref is ok. It eventually calls the `check_refname_component` in the `ref.c` file. And I have understood the rules of the ref name as the following: - Hierarchial Grouping: - Valid: `refs/heads/feature/new-feature` - Invalid: `refs/heads/.hidden-branch` (begins with a dot) - Invalid: `refs/heads/feature.lock` (ends with `.lock`) - Contain at Least One Slash - Valid: `refs/heads/main` - Invalid: `main` (no slash) - No Consecutive Dots - Valid: `refs/heads/feature/new.feature` - Invalid: `refs/heads/feature..new-feature` (contains `..`) - No ASCII Control Characters, Space, Tilde, Caret, or Colon - Valid: `refs/heads/feature/new-feature` - Invalid: `refs/heads/feature^new` (contains `^`) - Invalid: `refs/heads/feature: new` (contains `:` and space) - No Question-Mark, Asterisk, or Open Bracket - Valid: `refs/heads/feature/new-feature` - Invalid: `refs/heads/feature/new?feature` - Invalid: `refs/heads/feature/*` (contains `*`) - No Leading or Trailing Slash, No Multiple Consecutive Slashes - Valid: `refs/heads/feature/new-feature` - Invalid: `/refs/heads/feature/new-feature` (begins with `/`) - Invalid: `refs/heads//new-feature` (contains `//`) - No Ending with a Dot - Valid: `refs/heads/feature/new-feature` - Invalid: `refs/heads/feature.` (ends with `.`) - No Sequence `@{` - Valid: `refs/heads/feature/new-feature` - Invalid: `refs/heads/feature@{time}` (contains `@{`). - Not the Single Character `@` - Valid: `refs/heads/feature/@new-feature` - Invalid: `@` (single character `@`) - No Backslash - Valid: `refs/heads/feature/new-feature` - Invalid: `refs/heads/feature\new-feature` So for ref names, we could reuse this function for every backend. ### The Interface For Two Backends >From the discussion in [10], Patrick wrote: > For what it's worth, not all of the checks need to be implemented as > part of GSoC. At a minimum, it should result in the infra to allow for > backend-specific checks and a couple of checks for at least one of the > backends. I think the most important thing for this project is to setup the infrastructure with good flexibility which allows backend-specific check and programmer can add checks easily. This is the reason why I dive into ref internal, I want to understand how Git handles these two interfaces. And I want to reuse this idea to add ref consistency checks. As picture [11] shows, the `ref_store` is the core abstraction provided by the `ref_internal.h`, it is the base class which contains the `be` pointer which provides backend-specific functions whose interfaces are defined in the `ref_storage_be`. In my view, I think we could reuse this polymorphism. We could define some common interfaces or only one interface in the `ref_storage_be`. For every backend, it needs to provide its own function pointer. And we could reuse `ref_store` structure to do this. At last, we can use this polymorphism in `fsck.c` to elegantly setup the infrastructure. And it has the following advantages: - Reuse the current polymorphism provided by `ref_store`. - We can add checks easily for future. ### The Concrete Checks At current, this project proposal cannot determine the concrete checks. It should be discussed to make the checks concrete and clear. ### Timeline Pre-GSoC (Until May 1) - Continue to read the source code especially for retable backend. Understand the retable format to determine the concrete checks with mentors. Community Bonding (May 1 - May 26) - Talk with mentors to determine the detailed infrastructure should be implemented to provide flexibility for programmers. Also, determine the checks for the generic part and continue understanding the retable format and code. Phase I (May 27 - July 7) - Implement the infrastructure to support both file-backend and retable-backend. Write tests to make sure the correctness of the infrastructure. And then add the some generic ref checks both for file-backend and retable-backend (such as use existing `check_refname_component`) and write tests. Phase II (July 7 - Aug 19) - Talk with mentors to determine the ref checks for file-backend and continue to add ref checks for file-backend. Also add write unit tests. Final Week (Aug 19 - Aug 26) - Fix any bugs. - Write the final report. However, in my perspective, I hope this project could last 22 weeks and I could add ref checks for retable-backend. But due to uncertainly, I omit the timeline here. ### Availability Currently, I am engaged in a remote internship, which allows me to work 2-3 hours each evening from Monday to Friday. On weekends, I am able to dedicate 8 hours each day to work. As I am in my final year and do not have exams, I can commit to 20-30 hours of work per week. Even after becoming a full-time employee in July, my schedule will remain similar to my internship period. Additionally, I am able to continue working until November (22 Week Project). ## Post-GSoC After completing the GSoC, I plan to continue working on the code related to retable and read other modules in Git. This was my original intention for participating in GSoC, to integrate into the community through this opportunity and, if possible, to become a mentor in the future, igniting the flame in others' hearts. Most importantly, to "Get busy living." ## Closing Remarks Thanks for every review for reviewing this proposal. It’s my honor to work with the community. ## References [1] https://github.com/angular/angular/pull/44674 [2] https://github.com/angular/angular/pull/44628 [3] https://github.com/ehForwarderBot/efb-qq-plugin-go-cqhttp/pull/68 [4] https://github.com/shejialuo/database-systems-complete-book-solutions [5] https://luolibrary.com/categories/CS144/ [6] https://github.com/shejialuo/ugit-cpp [7] https://lore.kernel.org/git/6cfee0e4-3285-4f18-91ff-d097da9de737@xxxxxxx/ [8] https://lore.kernel.org/git/20240120010055.GC117170@xxxxxxxxxxxxxxxxxxxxxxx/ [9] https://lore.kernel.org/git/ZakIPEytlxHGCB9Y@tanuki/ [10] https://lore.kernel.org/git/Ze2E_xgfwTUzsQ92@ArchLinux/ [11] https://s2.loli.net/2024/03/23/3YGoBgZzHNtOfVw.png [12] https://lore.kernel.org/git/cover.1706601199.git.ps@xxxxxx/