Re: [PATCH v4 1/6] kbuild: add a tool to generate a list of files ignored by git

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 2, 2023 at 8:08 PM Nicolas Schier <nicolas@xxxxxxxxx> wrote:
>
> On Thu, Feb 02, 2023 at 12:37:11PM +0900 Masahiro Yamada wrote:
> > In short, the motivation of this commit is to build a source package
> > without cleaning the source tree.
> >
> > The deb-pkg and (src)rpm-pkg targets first run 'make clean' before
> > creating a source tarball. Otherwise build artifacts such as *.o,
> > *.a, etc. would be included in the tarball. Yet, the tarball ends up
> > containing several garbage files since 'make clean' does not clean
> > everything.
> >
> > Cleaning the tree every time is annoying since it makes the incremental
> > build impossible. It is desirable to create a source tarball without
> > cleaning the tree.
> >
> > In fact, there are some ways to archive this.
> >
> > The easiest way is 'git archive'. Actually, 'make perf-tar*-src-pkg'
> > does this way, but I do not like it because it works only when the source
> > tree is managed by git, and all files you want in the tarball must be
> > committed in advance.
> >
> > I want to make it work without relying on git. We can do this.
> >
> > Files that are not tracked by git are generated files. We can list them
> > out by parsing the .gitignore files. Of course, .gitignore does not cover
> > all the cases, but it works well enough.
> >
> > tar(1) claims to support it:
> >
> >   --exclude-vcs-ignores
> >
> >     Exclude files that match patterns read from VCS-specific ignore files.
> >     Supported files are: .cvsignore, .gitignore, .bzrignore, and .hgignore.
> >
> > The best scenario would be to use 'tar --exclude-vcs-ignores', but this
> > option does not work. --exclude-vcs-ignore does not understand any of
> > the negation (!), preceding slash, following slash, etc.. So, this option
> > is just useless.
> >
> > Hence, I wrote this gitignore parser. The previous version [1], written
> > in Python, was so slow. This version is implemented in C, so it works
> > much faster.
> >
> > This tool traverses the source tree, parsing the .gitignore files. It
> > prints the file paths that are not tracked by git. The output can be
> > used for tar's --exclude-from= option.
> >
> > [How to test this tool]
> >
> >   $ git clean -dfx
> >   $ make -s -j$(nproc) defconfig all                       # or allmodconifg or whatever
> >   $ git archive -o ../linux1.tar --prefix=./ HEAD
> >   $ tar tf ../linux1.tar | LANG=C sort > ../file-list1     # files emitted by 'git archive'
> >   $ make scripts_exclude
> >     HOSTCC  scripts/gen-exclude
> >   $ scripts/gen-exclude --prefix=./ -o ../exclude-list
> >   $ tar cf ../linux2.tar --exclude-from=../exclude-list .
> >   $ tar tf ../linux2.tar | LANG=C sort > ../file-list2     # files emitted by 'tar'
> >   $ diff  ../file-list1 ../file-list2 | grep -E '^(<|>)'
> >   < ./Documentation/devicetree/bindings/.yamllint
> >   < ./drivers/clk/.kunitconfig
> >   < ./drivers/gpu/drm/tests/.kunitconfig
> >   < ./drivers/gpu/drm/vc4/tests/.kunitconfig
> >   < ./drivers/hid/.kunitconfig
> >   < ./fs/ext4/.kunitconfig
> >   < ./fs/fat/.kunitconfig
> >   < ./kernel/kcsan/.kunitconfig
> >   < ./lib/kunit/.kunitconfig
> >   < ./mm/kfence/.kunitconfig
> >   < ./net/sunrpc/.kunitconfig
> >   < ./tools/testing/selftests/arm64/tags/
> >   < ./tools/testing/selftests/arm64/tags/.gitignore
> >   < ./tools/testing/selftests/arm64/tags/Makefile
> >   < ./tools/testing/selftests/arm64/tags/run_tags_test.sh
> >   < ./tools/testing/selftests/arm64/tags/tags_test.c
> >   < ./tools/testing/selftests/kvm/.gitignore
> >   < ./tools/testing/selftests/kvm/Makefile
> >   < ./tools/testing/selftests/kvm/config
> >   < ./tools/testing/selftests/kvm/settings
> >
> > The source tarball contains most of files that are tracked by git. You
> > see some diffs, but it is just because some .gitignore files are wrong.
> >
> >   $ git ls-files -i -c --exclude-per-directory=.gitignore
> >   Documentation/devicetree/bindings/.yamllint
> >   drivers/clk/.kunitconfig
> >   drivers/gpu/drm/tests/.kunitconfig
> >   drivers/hid/.kunitconfig
> >   fs/ext4/.kunitconfig
> >   fs/fat/.kunitconfig
> >   kernel/kcsan/.kunitconfig
> >   lib/kunit/.kunitconfig
> >   mm/kfence/.kunitconfig
> >   tools/testing/selftests/arm64/tags/.gitignore
> >   tools/testing/selftests/arm64/tags/Makefile
> >   tools/testing/selftests/arm64/tags/run_tags_test.sh
> >   tools/testing/selftests/arm64/tags/tags_test.c
> >   tools/testing/selftests/kvm/.gitignore
> >   tools/testing/selftests/kvm/Makefile
> >   tools/testing/selftests/kvm/config
> >   tools/testing/selftests/kvm/settings
> >
> > [1]: https://lore.kernel.org/all/20230128173843.765212-1-masahiroy@xxxxxxxxxx/
> >
> > Signed-off-by: Masahiro Yamada <masahiroy@xxxxxxxxxx>
> > ---
> >
> > (no changes since v3)
> >
> > Changes in v3:
> >  - Various code refactoring: remove struct gitignore, remove next: label etc.
> >  - Support --extra-pattern option
> >
> > Changes in v2:
> >  - Reimplement in C
> >
> >  Makefile              |   4 +
> >  scripts/.gitignore    |   1 +
> >  scripts/Makefile      |   2 +-
> >  scripts/gen-exclude.c | 623 ++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 629 insertions(+), 1 deletion(-)
> >  create mode 100644 scripts/gen-exclude.c
> >
> > diff --git a/Makefile b/Makefile
> > index 2faf872b6808..35b294cc6f32 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1652,6 +1652,10 @@ distclean: mrproper
> >  %pkg: include/config/kernel.release FORCE
> >       $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.package $@
> >
> > +PHONY += scripts_exclude
> > +scripts_exclude: scripts_basic
> > +     $(Q)$(MAKE) $(build)=scripts scripts/gen-exclude
> > +
> >  # Brief documentation of the typical targets used
> >  # ---------------------------------------------------------------------------
> >
> > diff --git a/scripts/.gitignore b/scripts/.gitignore
> > index 6e9ce6720a05..7f433bc1461c 100644
> > --- a/scripts/.gitignore
> > +++ b/scripts/.gitignore
> > @@ -1,5 +1,6 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> >  /asn1_compiler
> > +/gen-exclude
> >  /generate_rust_target
> >  /insert-sys-cert
> >  /kallsyms
> > diff --git a/scripts/Makefile b/scripts/Makefile
> > index 32b6ba722728..5dcd7f57607f 100644
> > --- a/scripts/Makefile
> > +++ b/scripts/Makefile
> > @@ -38,7 +38,7 @@ HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
> >  endif
> >
> >  # The following programs are only built on demand
> > -hostprogs += unifdef
> > +hostprogs += gen-exclude unifdef
> >
> >  # The module linker script is preprocessed on demand
> >  targets += module.lds
> > diff --git a/scripts/gen-exclude.c b/scripts/gen-exclude.c
> > new file mode 100644
> > index 000000000000..5c4ecd902290
> > --- /dev/null
> > +++ b/scripts/gen-exclude.c
> > @@ -0,0 +1,623 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +//
> > +// Traverse the source tree, parsing all .gitignore files, and print file paths
> > +// that are not tracked by git.
> > +// The output is suitable to the --exclude-from option of tar.
> > +// This is useful until the --exclude-vcs-ignores option gets working correctly.
> > +//
> > +// Copyright (C) 2023 Masahiro Yamada <masahiroy@xxxxxxxxxx>
> > +
> > +#include <dirent.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <fnmatch.h>
> > +#include <getopt.h>
> > +#include <stdarg.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <unistd.h>
> > +
> > +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
> > +
> > +// struct pattern - represent an ignore pattern (a line in .gitignroe)
> > +// @negate:          negate the pattern (prefixing '!')
> > +// @dir_only:        only matches directories (trailing '/')
> > +// @path_match:      true if the glob pattern is a path instead of a file name
> > +// @double_asterisk: true if the glob pattern contains double asterisks ('**')
> > +// @glob:            glob pattern
> > +struct pattern {
> > +     bool negate;
> > +     bool dir_only;
> > +     bool path_match;
> > +     bool double_asterisk;
> > +     char glob[];
> > +};
> > +
> > +struct pattern **patterns;
>
> Is there a reason, why patterns is not static?  (sparse asked)


No reason - I just forgot to run sparse.
Thanks for catching it.






> > +     q = modified_pattern;
> > +     for (p = pattern; *p; p++) {
> > +             if (!strncmp(p, "**/", 3)) {
> > +                     // "**/" means zero of more sequences of '*/".
> > +                     // "foo**/bar" matches "foobar", "foo*/bar",
> > +                     // "foo*/*/bar", etc.
> > +                     while (slash_diff-- > 0) {
> > +                             *q++ = '*';
> > +                             *q++ = '/';
> > +                     }
> > +
> > +                     if (slash_diff == 0) {
> > +                             *q++ = '*';
> > +                             *q++ = '/';
> > +                     }
> > +
> > +                     if (slash_diff < 0)
> > +                             slash_diff++;
> > +
> > +                     p += 2;
> > +             } else if (!strcmp(p, "/**")) {
> > +                     // A trailing "/**" matches everything inside.
>
> In v2 you also checked against "(*p + 3) == '\0'".  Is the explicit check
> against end-of-string really not needed here?  (pattern = "whatever/**/*.tmp"?)


This detects a trailing "/**".


See this documentation:
https://github.com/git/git/blob/v2.39.1/Documentation/gitignore.txt#L123



"whatever/**/*.tmp" is detected by the previous
     if (!strncmp(p, "**/", 3))


strcmp(p, "/**") only matches the pattern at the end,
while strncmp(p, "**/", 3) matches the pattern anywhere.


Anyway, I will throw away this code in v5.






> > +}
> > +
> > +// Return the length of the initial segment of the string that does not contain
> > +// the unquoted sequence of the given character. Similar to strcspn() in libc.
>
> I struggled across that comment and it took me quite some time to match it to
> strcspn_trailers() behaviour.  I expect it to strip all unescaped occurrences
> of c at the end of str and return the resulting strlen.  After reading it
> several times, I can get a match.  I _think_ main confusion came from my (quite
> imperfect) English:
>
>   "one two  "
>    ^^^         initial segment of string not containing unquoted c ??
>
>    ^^^^^^^     substr that is considered by strcspn_trailer
>
> But this is just about a comment and I'm sure I understand what is intended.
> No action required.


I am not good at English.

Indeed, this comment is really confusing.

Something like the following would have been clearer.

// This function strips the unescaped sequence of the given char from the end
// of the string, and returns the length of the resulting substring.





>
> > +static size_t strcspn_trailer(const char *str, char c)
> > +{
> > +     bool quoted = false;
> > +     size_t len = strlen(str);
> > +     size_t spn = len;
> > +     const char *s;
> > +
> > +     for (s = str; *s; s++) {
> > +             if (!quoted && *s == c) {
> > +                     if (s - str < spn)
> > +                             spn = s - str;
> > +             } else {
> > +                     spn = len;
>
> Is this really intended?  Or 'spn = str - s + 1'?


I think you meant, 'spn = s - str + 1'

My code works, but I think yours is cleaner
because it does not require 'len'.




BTW, I read the source code of GIT.

GIT's implementation is here:
https://github.com/git/git/blob/v2.39.1/dir.c#L934






>
> > +
> > +                     if (!quoted && *s == '\\')
> > +                             quoted = true;
> > +                     else
> > +                             quoted = false;
> > +             }
> > +     }
> > +
> > +     return spn;
> > +}
> > +
> > +// Add an gitignore pattern.
> > +static void add_pattern(char *s, const char *dirpath)
> > +{
> > +     bool negate = false;
> > +     bool dir_only = false;
> > +     bool path_match = false;
> > +     bool double_asterisk = false;
> > +     char *e = s + strlen(s);
> > +     struct pattern *p;
> > +     size_t len;
> > +
> > +     // Skip comments
> > +     if (*s == '#')
> > +             return;
> > +
> > +     // Trailing spaces are ignored unless they are quoted with backslash.
> > +     e = s + strcspn_trailer(s, ' ');
> > +     *e = '\0';
> > +
> > +     // The prefix '!' negates the pattern
> > +     if (*s == '!') {
> > +             s++;
> > +             negate = true;
> > +     }
> > +
> > +     // If there is slash(es) that is not escaped at the end of the pattern,
> > +     // it matches only directories.
>
> Are escaped slashes allowed in file names in git?  I think use of original
> strcspn() would have been enough.


Perhaps, I had some reason to implement it like this, but
I cannot recall it.



Anyway, GIT's implementation is very simple:

https://github.com/git/git/blob/v2.39.1/dir.c#L634

I will follow that.





>
> > +
> > +     if (path_match) {
> > +             strcat(p->glob, dirpath);
> > +             strcat(p->glob, "/");
> > +     }
> > +
> > +     strcat(p->glob, s);
> > +
> > +     debug("Add pattern: %s%s%s\n", negate ? "!" : "", p->glob,
> > +           dir_only ? "/" : "");
> > +
> > +     if (nr_patterns >= alloced_patterns) {
> > +             alloced_patterns += 128;
> > +             patterns = xrealloc(patterns,
> > +                                 sizeof(*patterns) * alloced_patterns);
> > +     }
> > +
> > +     patterns[nr_patterns++] = p;
> > +}
> > +
> > +static void *load_gitignore(const char *dirpath)
> > +{
> > +     struct stat st;
> > +     char path[PATH_MAX], *buf;
> > +     int fd, ret;
> > +
> > +     ret = snprintf(path, sizeof(path), "%s/.gitignore", dirpath);
> > +     if (ret >= sizeof(path))
> > +             error_exit("%s: too long path was truncated\n", path);
> > +
> > +     // If .gitignore does not exist in this directory, open() fails.
> > +     // It is ok, just skip it.
> > +     fd = open(path, O_RDONLY);
> > +     if (fd < 0)
> > +             return NULL;
>
> Why don't you check against errno == 2 (ENOENT)?  I assume, no other
> errno value is expected, but for me it feels a bit odd to not check it
> and exit loudly if something (unlikely) like EMFILE causes open() to
> fail.


Good suggestion.

I will fix it.

GIT also checks this:

https://github.com/git/git/blob/v2.39.1/wrapper.c#L399


>
> > +
> > +     if (fstat(fd, &st) < 0)
> > +             perror_exit(path);
> > +
> > +     buf = xmalloc(st.st_size + 1);
> > +     if (read(fd, buf, st.st_size) != st.st_size)
> > +             perror_exit(path);
> > +
> > +     buf[st.st_size] = '\0';
> > +     if (close(fd))
> > +             perror_exit(path);
> > +
> > +     return buf;
> > +}
> > +
> > +// Parse '.gitignore' in the given directory.
> > +static void parse_gitignore(const char *dirpath)
> > +{
> > +     char *buf, *s, *next;
> > +
> > +     buf = load_gitignore(dirpath);
> > +     if (!buf)
> > +             return;
> > +
> > +     debug("Parse %s/.gitignore\n", dirpath);
> > +
> > +     for (s = buf; *s; s = next) {
> > +             next = s;
> > +
> > +             while (*next != '\0' && *next != '\n')
>
> Not relevant for in-tree use: git does not complain about '\0' in a .gitignore
> but also handles the remaining part of the file.
>


You are right.

I confirmed it from the source code:
https://github.com/git/git/blob/v2.39.1/dir.c#L1141


I will follow that.





>
> Testing with some strange patterns seems to reveal some missing points.  It
> should not be problematic, as nobody wants to write such .gitignore patterns,
> but for completeness:
>
>   $ mkdir -p test/foo/bar
>   $ touch test/foo/bar/baz.tmp
>   $ cat <<-eof >test/.gitignore
>   **/*.tmp
>   **/baz.tmp
>   foo/**/*.tmp
>   **/bar/baz.tmp
>   /**/*.tmp
>   eof
>   $ cd test
>   $ ../scripts/gen-exclude --debug
>   [DEBUG]Add pattern: .git/
>   [DEBUG]Enter[0]: .
>   [DEBUG]  ./test: no match
>   [DEBUG]  Enter[1]: ./test
>   [DEBUG]    Parse ./test/.gitignore
>   [DEBUG]    Add pattern: ./test/**/*.tmp
>   [DEBUG]    Add pattern: ./test/**/baz.tmp
>   [DEBUG]    Add pattern: ./test/foo/**/*.tmp
>   [DEBUG]    Add pattern: ./test/**/bar/baz.tmp
>   [DEBUG]    Add pattern: ./test/**/*.tmp
>   [DEBUG]    ./test/.gitignore: no match
>   [DEBUG]    ./test/foo: no match
>   [DEBUG]    Enter[2]: ./test/foo
>   [DEBUG]      ./test/foo/bar: no match
>   [DEBUG]      Enter[3]: ./test/foo/bar
>   [DEBUG]        ./test/foo/bar/baz.tmp: no match
>   [DEBUG]      Leave[3]: ./test/foo/bar
>   [DEBUG]    Leave[2]: ./test/foo
>   [DEBUG]  Leave[1]: ./test
>   [DEBUG]Leave[0]: .
>
> Thus, no match.  Everything else I tested, did what I expected.


You are right.

test/foo/bar/baz.tmp must be ignored.


I read the code because I was curious how GIT does this.

GIT has its own fnmatch() that supports double asterisks too.
https://github.com/git/git/blob/v2.39.1/wildmatch.c#L55


I cannot write such clever code, so I will
import the matching code in v5.


V5 is almost ready for submission.
The code grew up to 1000 lines, but I can live with that.



In my local test, v5 worked correctly.


[DEBUG] Add pattern: .git/
[DEBUG] Enter[0]: .
[DEBUG]   ./test: no match
[DEBUG]   Enter[1]: ./test
[DEBUG]     Parse ./test/.gitignore
[DEBUG]     Add pattern: **/*.tmp
[DEBUG]     Add pattern: **/baz.tmp
[DEBUG]     Add pattern: foo/**/*.tmp
[DEBUG]     Add pattern: **/bar/baz.tmp
[DEBUG]     Add pattern: /**/*.tmp
[DEBUG]     ./test/foo: no match
[DEBUG]     Enter[2]: ./test/foo
[DEBUG]       ./test/foo/bar: no match
[DEBUG]       Enter[3]: ./test/foo/bar
[DEBUG]         ./test/foo/bar/baz.tmp: matches /**/*.tmp (./test/.gitignore)
[DEBUG]         Ignore: ./test/foo/bar/baz.tmp
test/foo/bar/baz.tmp
[DEBUG]         Ignore: ./test/foo/bar (due to all files inside ignored)
test/foo/bar
[DEBUG]       Leave[3]: ./test/foo/bar
[DEBUG]       Ignore: ./test/foo (due to all files inside ignored)
test/foo
[DEBUG]     Leave[2]: ./test/foo
[DEBUG]     ./test/.gitignore: no match
[DEBUG]   Leave[1]: ./test
[DEBUG] Leave[0]: .







>
> Reviewed-by: Nicolas Schier <nicolas@xxxxxxxxx>
> Tested-by: Nicolas Schier <nicolas@xxxxxxxxx>


Thanks for your close review, as always.



>
> Kind regards,
> Nicolas

--
Best Regards
Masahiro Yamada



[Index of Archives]     [Linux&nblp;USB Development]     [Linux Media]     [Video for Linux]     [Linux Audio Users]     [Yosemite Secrets]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux