On Thu, Feb 2, 2023 at 8:08 PM Nicolas Schier <nicolas@xxxxxxxxx> wrote: > > On Thu, Feb 02, 2023 at 12:37:11PM +0900 Masahiro Yamada wrote: > > In short, the motivation of this commit is to build a source package > > without cleaning the source tree. > > > > The deb-pkg and (src)rpm-pkg targets first run 'make clean' before > > creating a source tarball. Otherwise build artifacts such as *.o, > > *.a, etc. would be included in the tarball. Yet, the tarball ends up > > containing several garbage files since 'make clean' does not clean > > everything. > > > > Cleaning the tree every time is annoying since it makes the incremental > > build impossible. It is desirable to create a source tarball without > > cleaning the tree. > > > > In fact, there are some ways to archive this. > > > > The easiest way is 'git archive'. Actually, 'make perf-tar*-src-pkg' > > does this way, but I do not like it because it works only when the source > > tree is managed by git, and all files you want in the tarball must be > > committed in advance. > > > > I want to make it work without relying on git. We can do this. > > > > Files that are not tracked by git are generated files. We can list them > > out by parsing the .gitignore files. Of course, .gitignore does not cover > > all the cases, but it works well enough. > > > > tar(1) claims to support it: > > > > --exclude-vcs-ignores > > > > Exclude files that match patterns read from VCS-specific ignore files. > > Supported files are: .cvsignore, .gitignore, .bzrignore, and .hgignore. > > > > The best scenario would be to use 'tar --exclude-vcs-ignores', but this > > option does not work. --exclude-vcs-ignore does not understand any of > > the negation (!), preceding slash, following slash, etc.. So, this option > > is just useless. > > > > Hence, I wrote this gitignore parser. The previous version [1], written > > in Python, was so slow. This version is implemented in C, so it works > > much faster. > > > > This tool traverses the source tree, parsing the .gitignore files. It > > prints the file paths that are not tracked by git. The output can be > > used for tar's --exclude-from= option. > > > > [How to test this tool] > > > > $ git clean -dfx > > $ make -s -j$(nproc) defconfig all # or allmodconifg or whatever > > $ git archive -o ../linux1.tar --prefix=./ HEAD > > $ tar tf ../linux1.tar | LANG=C sort > ../file-list1 # files emitted by 'git archive' > > $ make scripts_exclude > > HOSTCC scripts/gen-exclude > > $ scripts/gen-exclude --prefix=./ -o ../exclude-list > > $ tar cf ../linux2.tar --exclude-from=../exclude-list . > > $ tar tf ../linux2.tar | LANG=C sort > ../file-list2 # files emitted by 'tar' > > $ diff ../file-list1 ../file-list2 | grep -E '^(<|>)' > > < ./Documentation/devicetree/bindings/.yamllint > > < ./drivers/clk/.kunitconfig > > < ./drivers/gpu/drm/tests/.kunitconfig > > < ./drivers/gpu/drm/vc4/tests/.kunitconfig > > < ./drivers/hid/.kunitconfig > > < ./fs/ext4/.kunitconfig > > < ./fs/fat/.kunitconfig > > < ./kernel/kcsan/.kunitconfig > > < ./lib/kunit/.kunitconfig > > < ./mm/kfence/.kunitconfig > > < ./net/sunrpc/.kunitconfig > > < ./tools/testing/selftests/arm64/tags/ > > < ./tools/testing/selftests/arm64/tags/.gitignore > > < ./tools/testing/selftests/arm64/tags/Makefile > > < ./tools/testing/selftests/arm64/tags/run_tags_test.sh > > < ./tools/testing/selftests/arm64/tags/tags_test.c > > < ./tools/testing/selftests/kvm/.gitignore > > < ./tools/testing/selftests/kvm/Makefile > > < ./tools/testing/selftests/kvm/config > > < ./tools/testing/selftests/kvm/settings > > > > The source tarball contains most of files that are tracked by git. You > > see some diffs, but it is just because some .gitignore files are wrong. > > > > $ git ls-files -i -c --exclude-per-directory=.gitignore > > Documentation/devicetree/bindings/.yamllint > > drivers/clk/.kunitconfig > > drivers/gpu/drm/tests/.kunitconfig > > drivers/hid/.kunitconfig > > fs/ext4/.kunitconfig > > fs/fat/.kunitconfig > > kernel/kcsan/.kunitconfig > > lib/kunit/.kunitconfig > > mm/kfence/.kunitconfig > > tools/testing/selftests/arm64/tags/.gitignore > > tools/testing/selftests/arm64/tags/Makefile > > tools/testing/selftests/arm64/tags/run_tags_test.sh > > tools/testing/selftests/arm64/tags/tags_test.c > > tools/testing/selftests/kvm/.gitignore > > tools/testing/selftests/kvm/Makefile > > tools/testing/selftests/kvm/config > > tools/testing/selftests/kvm/settings > > > > [1]: https://lore.kernel.org/all/20230128173843.765212-1-masahiroy@xxxxxxxxxx/ > > > > Signed-off-by: Masahiro Yamada <masahiroy@xxxxxxxxxx> > > --- > > > > (no changes since v3) > > > > Changes in v3: > > - Various code refactoring: remove struct gitignore, remove next: label etc. > > - Support --extra-pattern option > > > > Changes in v2: > > - Reimplement in C > > > > Makefile | 4 + > > scripts/.gitignore | 1 + > > scripts/Makefile | 2 +- > > scripts/gen-exclude.c | 623 ++++++++++++++++++++++++++++++++++++++++++ > > 4 files changed, 629 insertions(+), 1 deletion(-) > > create mode 100644 scripts/gen-exclude.c > > > > diff --git a/Makefile b/Makefile > > index 2faf872b6808..35b294cc6f32 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -1652,6 +1652,10 @@ distclean: mrproper > > %pkg: include/config/kernel.release FORCE > > $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.package $@ > > > > +PHONY += scripts_exclude > > +scripts_exclude: scripts_basic > > + $(Q)$(MAKE) $(build)=scripts scripts/gen-exclude > > + > > # Brief documentation of the typical targets used > > # --------------------------------------------------------------------------- > > > > diff --git a/scripts/.gitignore b/scripts/.gitignore > > index 6e9ce6720a05..7f433bc1461c 100644 > > --- a/scripts/.gitignore > > +++ b/scripts/.gitignore > > @@ -1,5 +1,6 @@ > > # SPDX-License-Identifier: GPL-2.0-only > > /asn1_compiler > > +/gen-exclude > > /generate_rust_target > > /insert-sys-cert > > /kallsyms > > diff --git a/scripts/Makefile b/scripts/Makefile > > index 32b6ba722728..5dcd7f57607f 100644 > > --- a/scripts/Makefile > > +++ b/scripts/Makefile > > @@ -38,7 +38,7 @@ HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED > > endif > > > > # The following programs are only built on demand > > -hostprogs += unifdef > > +hostprogs += gen-exclude unifdef > > > > # The module linker script is preprocessed on demand > > targets += module.lds > > diff --git a/scripts/gen-exclude.c b/scripts/gen-exclude.c > > new file mode 100644 > > index 000000000000..5c4ecd902290 > > --- /dev/null > > +++ b/scripts/gen-exclude.c > > @@ -0,0 +1,623 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +// > > +// Traverse the source tree, parsing all .gitignore files, and print file paths > > +// that are not tracked by git. > > +// The output is suitable to the --exclude-from option of tar. > > +// This is useful until the --exclude-vcs-ignores option gets working correctly. > > +// > > +// Copyright (C) 2023 Masahiro Yamada <masahiroy@xxxxxxxxxx> > > + > > +#include <dirent.h> > > +#include <errno.h> > > +#include <fcntl.h> > > +#include <fnmatch.h> > > +#include <getopt.h> > > +#include <stdarg.h> > > +#include <stdbool.h> > > +#include <stdio.h> > > +#include <stdlib.h> > > +#include <string.h> > > +#include <sys/stat.h> > > +#include <sys/types.h> > > +#include <unistd.h> > > + > > +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) > > + > > +// struct pattern - represent an ignore pattern (a line in .gitignroe) > > +// @negate: negate the pattern (prefixing '!') > > +// @dir_only: only matches directories (trailing '/') > > +// @path_match: true if the glob pattern is a path instead of a file name > > +// @double_asterisk: true if the glob pattern contains double asterisks ('**') > > +// @glob: glob pattern > > +struct pattern { > > + bool negate; > > + bool dir_only; > > + bool path_match; > > + bool double_asterisk; > > + char glob[]; > > +}; > > + > > +struct pattern **patterns; > > Is there a reason, why patterns is not static? (sparse asked) No reason - I just forgot to run sparse. Thanks for catching it. > > + q = modified_pattern; > > + for (p = pattern; *p; p++) { > > + if (!strncmp(p, "**/", 3)) { > > + // "**/" means zero of more sequences of '*/". > > + // "foo**/bar" matches "foobar", "foo*/bar", > > + // "foo*/*/bar", etc. > > + while (slash_diff-- > 0) { > > + *q++ = '*'; > > + *q++ = '/'; > > + } > > + > > + if (slash_diff == 0) { > > + *q++ = '*'; > > + *q++ = '/'; > > + } > > + > > + if (slash_diff < 0) > > + slash_diff++; > > + > > + p += 2; > > + } else if (!strcmp(p, "/**")) { > > + // A trailing "/**" matches everything inside. > > In v2 you also checked against "(*p + 3) == '\0'". Is the explicit check > against end-of-string really not needed here? (pattern = "whatever/**/*.tmp"?) This detects a trailing "/**". See this documentation: https://github.com/git/git/blob/v2.39.1/Documentation/gitignore.txt#L123 "whatever/**/*.tmp" is detected by the previous if (!strncmp(p, "**/", 3)) strcmp(p, "/**") only matches the pattern at the end, while strncmp(p, "**/", 3) matches the pattern anywhere. Anyway, I will throw away this code in v5. > > +} > > + > > +// Return the length of the initial segment of the string that does not contain > > +// the unquoted sequence of the given character. Similar to strcspn() in libc. > > I struggled across that comment and it took me quite some time to match it to > strcspn_trailers() behaviour. I expect it to strip all unescaped occurrences > of c at the end of str and return the resulting strlen. After reading it > several times, I can get a match. I _think_ main confusion came from my (quite > imperfect) English: > > "one two " > ^^^ initial segment of string not containing unquoted c ?? > > ^^^^^^^ substr that is considered by strcspn_trailer > > But this is just about a comment and I'm sure I understand what is intended. > No action required. I am not good at English. Indeed, this comment is really confusing. Something like the following would have been clearer. // This function strips the unescaped sequence of the given char from the end // of the string, and returns the length of the resulting substring. > > > +static size_t strcspn_trailer(const char *str, char c) > > +{ > > + bool quoted = false; > > + size_t len = strlen(str); > > + size_t spn = len; > > + const char *s; > > + > > + for (s = str; *s; s++) { > > + if (!quoted && *s == c) { > > + if (s - str < spn) > > + spn = s - str; > > + } else { > > + spn = len; > > Is this really intended? Or 'spn = str - s + 1'? I think you meant, 'spn = s - str + 1' My code works, but I think yours is cleaner because it does not require 'len'. BTW, I read the source code of GIT. GIT's implementation is here: https://github.com/git/git/blob/v2.39.1/dir.c#L934 > > > + > > + if (!quoted && *s == '\\') > > + quoted = true; > > + else > > + quoted = false; > > + } > > + } > > + > > + return spn; > > +} > > + > > +// Add an gitignore pattern. > > +static void add_pattern(char *s, const char *dirpath) > > +{ > > + bool negate = false; > > + bool dir_only = false; > > + bool path_match = false; > > + bool double_asterisk = false; > > + char *e = s + strlen(s); > > + struct pattern *p; > > + size_t len; > > + > > + // Skip comments > > + if (*s == '#') > > + return; > > + > > + // Trailing spaces are ignored unless they are quoted with backslash. > > + e = s + strcspn_trailer(s, ' '); > > + *e = '\0'; > > + > > + // The prefix '!' negates the pattern > > + if (*s == '!') { > > + s++; > > + negate = true; > > + } > > + > > + // If there is slash(es) that is not escaped at the end of the pattern, > > + // it matches only directories. > > Are escaped slashes allowed in file names in git? I think use of original > strcspn() would have been enough. Perhaps, I had some reason to implement it like this, but I cannot recall it. Anyway, GIT's implementation is very simple: https://github.com/git/git/blob/v2.39.1/dir.c#L634 I will follow that. > > > + > > + if (path_match) { > > + strcat(p->glob, dirpath); > > + strcat(p->glob, "/"); > > + } > > + > > + strcat(p->glob, s); > > + > > + debug("Add pattern: %s%s%s\n", negate ? "!" : "", p->glob, > > + dir_only ? "/" : ""); > > + > > + if (nr_patterns >= alloced_patterns) { > > + alloced_patterns += 128; > > + patterns = xrealloc(patterns, > > + sizeof(*patterns) * alloced_patterns); > > + } > > + > > + patterns[nr_patterns++] = p; > > +} > > + > > +static void *load_gitignore(const char *dirpath) > > +{ > > + struct stat st; > > + char path[PATH_MAX], *buf; > > + int fd, ret; > > + > > + ret = snprintf(path, sizeof(path), "%s/.gitignore", dirpath); > > + if (ret >= sizeof(path)) > > + error_exit("%s: too long path was truncated\n", path); > > + > > + // If .gitignore does not exist in this directory, open() fails. > > + // It is ok, just skip it. > > + fd = open(path, O_RDONLY); > > + if (fd < 0) > > + return NULL; > > Why don't you check against errno == 2 (ENOENT)? I assume, no other > errno value is expected, but for me it feels a bit odd to not check it > and exit loudly if something (unlikely) like EMFILE causes open() to > fail. Good suggestion. I will fix it. GIT also checks this: https://github.com/git/git/blob/v2.39.1/wrapper.c#L399 > > > + > > + if (fstat(fd, &st) < 0) > > + perror_exit(path); > > + > > + buf = xmalloc(st.st_size + 1); > > + if (read(fd, buf, st.st_size) != st.st_size) > > + perror_exit(path); > > + > > + buf[st.st_size] = '\0'; > > + if (close(fd)) > > + perror_exit(path); > > + > > + return buf; > > +} > > + > > +// Parse '.gitignore' in the given directory. > > +static void parse_gitignore(const char *dirpath) > > +{ > > + char *buf, *s, *next; > > + > > + buf = load_gitignore(dirpath); > > + if (!buf) > > + return; > > + > > + debug("Parse %s/.gitignore\n", dirpath); > > + > > + for (s = buf; *s; s = next) { > > + next = s; > > + > > + while (*next != '\0' && *next != '\n') > > Not relevant for in-tree use: git does not complain about '\0' in a .gitignore > but also handles the remaining part of the file. > You are right. I confirmed it from the source code: https://github.com/git/git/blob/v2.39.1/dir.c#L1141 I will follow that. > > Testing with some strange patterns seems to reveal some missing points. It > should not be problematic, as nobody wants to write such .gitignore patterns, > but for completeness: > > $ mkdir -p test/foo/bar > $ touch test/foo/bar/baz.tmp > $ cat <<-eof >test/.gitignore > **/*.tmp > **/baz.tmp > foo/**/*.tmp > **/bar/baz.tmp > /**/*.tmp > eof > $ cd test > $ ../scripts/gen-exclude --debug > [DEBUG]Add pattern: .git/ > [DEBUG]Enter[0]: . > [DEBUG] ./test: no match > [DEBUG] Enter[1]: ./test > [DEBUG] Parse ./test/.gitignore > [DEBUG] Add pattern: ./test/**/*.tmp > [DEBUG] Add pattern: ./test/**/baz.tmp > [DEBUG] Add pattern: ./test/foo/**/*.tmp > [DEBUG] Add pattern: ./test/**/bar/baz.tmp > [DEBUG] Add pattern: ./test/**/*.tmp > [DEBUG] ./test/.gitignore: no match > [DEBUG] ./test/foo: no match > [DEBUG] Enter[2]: ./test/foo > [DEBUG] ./test/foo/bar: no match > [DEBUG] Enter[3]: ./test/foo/bar > [DEBUG] ./test/foo/bar/baz.tmp: no match > [DEBUG] Leave[3]: ./test/foo/bar > [DEBUG] Leave[2]: ./test/foo > [DEBUG] Leave[1]: ./test > [DEBUG]Leave[0]: . > > Thus, no match. Everything else I tested, did what I expected. You are right. test/foo/bar/baz.tmp must be ignored. I read the code because I was curious how GIT does this. GIT has its own fnmatch() that supports double asterisks too. https://github.com/git/git/blob/v2.39.1/wildmatch.c#L55 I cannot write such clever code, so I will import the matching code in v5. V5 is almost ready for submission. The code grew up to 1000 lines, but I can live with that. In my local test, v5 worked correctly. [DEBUG] Add pattern: .git/ [DEBUG] Enter[0]: . [DEBUG] ./test: no match [DEBUG] Enter[1]: ./test [DEBUG] Parse ./test/.gitignore [DEBUG] Add pattern: **/*.tmp [DEBUG] Add pattern: **/baz.tmp [DEBUG] Add pattern: foo/**/*.tmp [DEBUG] Add pattern: **/bar/baz.tmp [DEBUG] Add pattern: /**/*.tmp [DEBUG] ./test/foo: no match [DEBUG] Enter[2]: ./test/foo [DEBUG] ./test/foo/bar: no match [DEBUG] Enter[3]: ./test/foo/bar [DEBUG] ./test/foo/bar/baz.tmp: matches /**/*.tmp (./test/.gitignore) [DEBUG] Ignore: ./test/foo/bar/baz.tmp test/foo/bar/baz.tmp [DEBUG] Ignore: ./test/foo/bar (due to all files inside ignored) test/foo/bar [DEBUG] Leave[3]: ./test/foo/bar [DEBUG] Ignore: ./test/foo (due to all files inside ignored) test/foo [DEBUG] Leave[2]: ./test/foo [DEBUG] ./test/.gitignore: no match [DEBUG] Leave[1]: ./test [DEBUG] Leave[0]: . > > Reviewed-by: Nicolas Schier <nicolas@xxxxxxxxx> > Tested-by: Nicolas Schier <nicolas@xxxxxxxxx> Thanks for your close review, as always. > > Kind regards, > Nicolas -- Best Regards Masahiro Yamada