Hi! To get the work going I set up git repositories with tools and results. See the document below. As a follow up I'm going to post the first few patches from step2 (GPL boiler plate replacement) so you get the idea how this looks like and we can discuss how we proceed with review etc. Thanks, tglx Machine assisted license cleanup -------------------------------- 1. Tools for reproduction: 1.1 scancode toolkit A license scanner tool which can be run from the command line and provides excellent parellelisation. While fast, its recommended to be run on a machine with tons of CPUs and tons of Memory. A run with 128 parallel scan threads takes about 15 minutes. Go figure how long it will take on your laptop :) https://github.com/nexB/scancode-toolkit 1.2 spdx helper scripts A bunch of horrible python scripts with even more horrible shell glue. git://git.kernel.org/pub/scm/utils/spdx/spdx-utils gitweb URL: https://git.kernel.org/pub/scm/utils/spdx/spdx-utils.git The main workhorse is lcheck.py. I wrote it initialy to gather statistics and other information, but over time it evolved to a swiss army knife. lcheck.py --help gives you the gory details, no manpage sorry. 1.3 git The git tools must be available. A clean linux tree must be cloned. Ensure that there are no artifacts from editing, patch directories etc. To reproduce the setup (in case you have a big enough machine or lots of time for thumb twiddling): - Install scancode and git. If you need help with scancode talk to Philipe. - Clone the linux kernel - Clone the spdx scripts - cd into the spdx scripts directory - invoke the runscript with: ./runall.sh path/to/linux/kernel The path can be relative or absolute - Wait .... - Check the results in the stepX directories - Chech the results in the kernel directory (each step creates a branch). For your convenience: The spdx-utils repository contains aside of the master branch a branch linux-5.1. It contains: - the scancode json files for each step - the stats.txt file for each step - the rules which are handled in each step - the resulting patches The resulting kernel tree is pushed to: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-spdx.git Branches step1, step2, step3 contain the steps documented below. gitweb URL: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-spdx.git 2) Approach The Documentation directory is ignored for now. That needs some extra care. 2.1 Files with no license These files have not been touched during the first large sweep. 2.1.1 Build files Make/Kconfig files without license information 2.1.2 Source files which have only MODULE_LICENSE("GPL") and/or EXPORT_SYMBOL_GPL() Now that MODULE_LICENSE is clarified this can be tackled. The scripts identify these files in the scanner result and add the proper license identifier (GPL-2.0-only) The scripts generate patches which can be applied with quilt or imported into git with 'git quiltimport' SPDX count goes from 22574 to 25712 (44.9%) 2.2 Files with a single license: GPL-2.0-only or GPL-2.0-or-later The scripts handle the following tasks: - Find the affected files in the scanner output - Generate a list of match rules which represent a unique pattern This is achieved by normalizing the texts (removing formatting, white space damage, uppercase / lowercase and punctuation damage. - Add the appropriate license header and remove the boiler plate text or the license reference. - Create a patch series. Each patch contains only the modifications for a single match rule. The rule (and eventual variants) are saved in the change log of each patch to ease review - Once a reference dataset (compliance data provided by Siemens) is available the scripts will also check for conflicts with that data set. This results in 515 patches at the moment. The scripts generate patches which can be applied with quilt or imported into git with 'git quiltimport' SPDX count goes from 25712 to 46368 (80.7%) 2.3. Files with GPL-2.9-only/or-later and Linux-OpenIB Basically the same as above just with dual licensing. SPDX count goes from 46368 to 46865 (81.9%) 2.4 More fun later :) I have quite a bunch of steps in preparation but lets get the above agreed on and reviewed first.