On Fri, Jan 21, 2022 at 10:41:48 +0100, Tim Wiederhake wrote: > This is a wrapper for codespell [1], a spell checker for source code. > Codespell does not compare words to a dictionary, but rather works by > checking words against a list of common typos, making it produce fewer > false positives than other solutions. > > The script in this patch works around the lack of per-directory ignore > lists and some oddities regarding capitalization in ignore lists. > The ".codespellrc" file is used to coarsly filter out translation and > git files, as scanning those makes up for roughly 50% of the run time > otherwise. > > [1] (https://github.com/codespell-project/codespell/) > > Signed-off-by: Tim Wiederhake <twiederh@xxxxxxxxxx> > --- > .codespellrc | 2 + > scripts/check-spelling.py | 135 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 137 insertions(+) > create mode 100644 .codespellrc > create mode 100755 scripts/check-spelling.py > > diff --git a/.codespellrc b/.codespellrc > new file mode 100644 > index 0000000000..0c45be445b > --- /dev/null > +++ b/.codespellrc > @@ -0,0 +1,2 @@ > +[codespell] > +skip = .git/,*.po This doesn't seem to work for me, after patch 2/3 I'm getting the following: >>> MALLOC_PERTURB_=230 /bin/python3 /home/pipo/libvirt/scripts/check-spelling.py --ignore-untracked ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― stdout: (".git/logs/HEAD", "lenght"), # line 187, "length"? (".git/logs/HEAD", "programatic"), # line 424, "programmatic"? (".git/logs/HEAD", "programatic"), # line 570, "programmatic"? (".git/logs/HEAD", "programatic"), # line 583, "programmatic"? (".git/logs/HEAD", "programatic"), # line 714, "programmatic"? (".git/logs/HEAD", "programatic"), # line 727, "programmatic"? (".git/logs/HEAD", "programatic"), # line 802, "programmatic"? (".git/logs/HEAD", "programatic"), # line 810, "programmatic"? (".git/logs/HEAD", "programatic"), # line 816, "programmatic"? (".git/logs/HEAD", "programatic"), # line 935, "programmatic"? (".git/logs/HEAD", "programatic"), # line 944, "programmatic"? (".git/logs/HEAD", "programatic"), # line 1005, "programmatic"? (".git/logs/HEAD", "programatic"), # line 1007, "programmatic"? (".git/logs/HEAD", "programatic"), # line 1008, "programmatic"? (".git/logs/HEAD", "programatic"), # line 1009, "programmatic"? (".git/logs/HEAD", "programatic"), # line 1014, "programmatic"? ... (I'll be elaborating a bit more on the exclude pattern later after playing with codespell for a bit). > diff --git a/scripts/check-spelling.py b/scripts/check-spelling.py > new file mode 100755 > index 0000000000..ce3e7d89f0 > --- /dev/null > +++ b/scripts/check-spelling.py > @@ -0,0 +1,135 @@ > +#!/usr/bin/env python3 > + > +import argparse > +import re > +import subprocess > +import os > + > + > +IGNORE_LIST = [ > + # ignore this script > + ("scripts/check-spelling.py", []), > + > + # 3rd-party: keycodemapdb > + ("src/keycodemapdb/", []), > + > + # 3rd-party: VirtualBox SDK > + ("src/vbox/vbox_CAPI", []), > + > + # 3rd-party: qemu > + ("tests/qemucapabilitiesdata/caps_", []), > + > + # other > + ("", ["msdos", "MSDOS", "wan", "WAN", "hda", "HDA", "inout"]), These can be replaced by --ignore-words-list msdos,wan,hda,inout > + ("NEWS.rst", "crashers"), IMO this can be reworded. > + ("docs/gitdm/companies/others", "Archiv"), > + ("docs/glib-adoption.rst", "preferrable"), This too. > + ("docs/js/main.js", "whats"), Skip all of docs/js? > + ("examples/polkit/libvirt-acl.rules", ["userA", "userB", "userC"]), > + ("src/libvirt-domain.c", "PTD"), > + ("src/libxl/libxl_logger.c", "purposedly"), > + ("src/nwfilter/nwfilter_dhcpsnoop.c", "ether"), > + ("src/nwfilter/nwfilter_ebiptables_driver.c", "parm"), > + ("src/nwfilter/nwfilter_learnipaddr.c", "ether"), > + ("src/qemu/qemu_agent.c", "crypted"), > + ("src/qemu/qemu_agent.h", "crypted"), > + ("src/qemu/qemu_process.c", "wee"), This too. > + ("src/security/apparmor/libvirt-lxc", "devic"), > + ("src/security/apparmor/libvirt-qemu", "readby"), > + ("src/storage_file/storage_file_probe.c", "conectix"), > + ("src/util/virnetdevmacvlan.c", "calld"), > + ("src/util/virtpm.c", "parm"), > + ("tests/qemuagenttest.c", "IST"), > + ("tests/storagepoolxml2xml", "cant"), > + ("tests/sysinfodata/", "sie"), This is technically 3rd party. > + ("tests/testutils.c", "nIn"), > + ("tests/vircgroupdata/ovirt-node-6.6.mounts", "hald"), This is also 3rd party ` > + ("tests/virhostcpudata/", "sie"), This too. > + ("tools/virt-host-validate-common.c", "sie"), > +] > + > + > +def ignore(filename, linenumber, word, suggestion): > + if len(word) <= 2: > + return True > + > + for f, w in IGNORE_LIST: > + if not filename.startswith(f): > + continue > + if word in w or not w: > + return True > + return False > + > + > +def main(): > + line_pattern = re.compile("^(.*):(.*): (.*) ==> (.*)$") > + output_template = "(\"{0}\", \"{2}\"),\t# line {1}, \"{3}\"?" > + > + parser = argparse.ArgumentParser(description="Check spelling") > + parser.add_argument( > + "dir", > + help="Path to source directory. " > + "Defaults to parent directory of this script", > + type=os.path.realpath, > + nargs='?') > + parser.add_argument( > + "-i", > + "--ignore", > + help="File to ignore. Can be specified more than once", > + metavar="FILE", > + default=list(), > + action="append") This is IMO pointless to have as an option, as we have an internal list of ignore files and user it's not used with > + parser.add_argument( > + "--ignore-untracked", > + help="Ignore all files not tracked by git", > + action="store_true") IMO this option should not exist at all, there's no good reason to check files which are not tracked by git. > + args = parser.parse_args() > + > + if not args.dir: > + args.dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__))) > + > + if args.ignore_untracked: > + args.ignore.extend(subprocess.check_output( > + ["git", "-C", args.dir, "ls-files", "--others"], > + universal_newlines=True).split("\n")) > + > + try: > + process = subprocess.run( > + [ > + "codespell", > + args.dir, > + "--config", > + os.path.join(args.dir, ".codespellrc")], > + stdout=subprocess.PIPE, > + stderr=subprocess.PIPE, > + universal_newlines=True) > + except FileNotFoundError: > + exit("error: codespell not found") > + if process.returncode not in (0, 65): > + exit("error: unexpected returncode %s" % process.returncode) > + > + if process.stderr: > + exit("error: unexpected output to stderr: \"%s\"" % process.stderr) > + > + findings = 0 > + for line in process.stdout.split("\n"): > + line = line.strip().replace(args.dir, "").lstrip("/") > + if not line: > + continue > + > + match = line_pattern.match(line) > + if not match: > + exit("error: unexpected line: \"%s\"" % line) > + > + if match.group(1) in args.ignore or ignore(*match.groups()): > + continue > + > + print(output_template.format(*match.groups())) > + findings += 1 > + > + if findings: > + exit("error: %s spelling errors" % findings) > + I have a rather substantial problem with runtime of this contraption on my box a pure 'ninja test' takes a bit below 4 seconds ... real 0m3.604s user 0m34.187s sys 0m10.662s Now a pure codespell invocation with the arguments used here takes more than 4,2 seconds. In case it's run as part of the test suite it doubles the execution time. That is IMO not acceptable. We should be using the 'skip' pattern a bit more, either through generating a config file or commandline arguments. This will also allow to ignore the untracked files directly, by putting them into the ignore pattern, rather than filtering them out. I also propose to ignore all of 'tests/' there's simply too much noise and this would waste too many cpu cycles. With the following ignore pattern: codespell -L msdos,inout,hda,wan -S .git,po,tests,vbox_CAPI*,*.js,polkit,*.Dockerfile,keycodemapdb,cpu_map The analysis is now down to 1.4 seconds from 4.2 and also the invocation via ninja test is now bearable again. IMO it's not worth to fiddle with tests/ to try to include the source files and the time saving is immense. The above exclude pattern also fixes what I've complained above. codespell is matching either full absolute paths or a subdirectory name but it can't contain a '/' if it's relative.