Re: [PATCH v3 4/6] module: script to generate offset ranges for builtin modules

Masahiro Yamada <masahiroy@xxxxxxxxxx> · Tue, 21 May 2024 01:53:27 +0900

On Fri, May 17, 2024 at 1:31 PM Kris Van Hees <kris.van.hees@xxxxxxxxxx> wrote:
>
> The offset range data for builtin modules is generated using:
>  - modules.builtin.modinfo: associates object files with module names
>  - vmlinux.map: provides load order of sections and offset of first member
>     per section
>  - vmlinux.o.map: provides offset of object file content per section
>  - .*.cmd: build cmd file with KBUILD_MODFILE and KBUILD_MODNAME
>
> The generated data will look like:
>
> .text 00000000-00000000 = _text
> .text 0000baf0-0000cb10 amd_uncore
> .text 0009bd10-0009c8e0 iosf_mbi
> ...
> .text 008e6660-008e9630 snd_soc_wcd_mbhc
> .text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x
> .text 008ea610-008ea780 snd_soc_wcd9335
> ...
> .data 00000000-00000000 = _sdata
> .data 0000f020-0000f680 amd_uncore
>
> For each ELF section, it lists the offset of the first symbol.  This can
> be used to determine the base address of the section at runtime.
>
> Next, it lists (in strict ascending order) offset ranges in that section
> that cover the symbols of one or more builtin modules.  Multiple ranges
> can apply to a single module, and ranges can be shared between modules.
>
> Signed-off-by: Kris Van Hees <kris.van.hees@xxxxxxxxxx>
> Reviewed-by: Nick Alcock <nick.alcock@xxxxxxxxxx>
> ---
> Changes since v2:
>  - 1st arg to generate_builtin_ranges.awk is now modules.builtin.modinfo
>  - Switched from using modules.builtin.objs to parsing .*.cmd files
>  - Parse data from .*.cmd in generate_builtin_ranges.awk
> ---
>  scripts/generate_builtin_ranges.awk | 232 ++++++++++++++++++++++++++++
>  1 file changed, 232 insertions(+)
>  create mode 100755 scripts/generate_builtin_ranges.awk
>
> diff --git a/scripts/generate_builtin_ranges.awk b/scripts/generate_builtin_ranges.awk
> new file mode 100755
> index 0000000000000..6975a9c7266d9
> --- /dev/null
> +++ b/scripts/generate_builtin_ranges.awk
> @@ -0,0 +1,232 @@
> +#!/usr/bin/gawk -f
> +# SPDX-License-Identifier: GPL-2.0
> +# generate_builtin_ranges.awk: Generate address range data for builtin modules
> +# Written by Kris Van Hees <kris.van.hees@xxxxxxxxxx>
> +#
> +# Usage: generate_builtin_ranges.awk modules.builtin.modinfo vmlinux.map \
> +#              vmlinux.o.map > modules.builtin.ranges
> +#
> +
> +BEGIN {
> +       # modules.builtin.modinfo uses \0 as record separator
> +       # All other files use \n.
> +       RS = "[\n\0]";
> +}

Why do you want to parse modules.builtin.modinfo
instead of modules.builtin?

modules.builtin uses \n separator.

> +
> +# Return the module name(s) (if any) associated with the given object.
> +#
> +# If we have seen this object before, return information from the cache.
> +# Otherwise, retrieve it from the corresponding .cmd file.
> +#
> +function get_module_info(fn, mod, obj, mfn, s) {

There are 5 arguments, while the caller passes only 1 argument
( get_module_info($4) )

> +       if (fn in omod)
> +               return omod[fn];
> +
> +       if (match(fn, /\/[^/]+$/) == 0)
> +               return "";
> +
> +       obj = fn;
> +       mod = "";
> +       mfn = "";
> +       fn = substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd";
> +       if (getline s <fn == 1) {
> +               if (match(s, /DKBUILD_MODNAME=[^ ]+/) > 0) {
> +                       mod = substr(s, RSTART + 17, RLENGTH - 17);
> +                       gsub(/['"]/, "", mod);
> +                       gsub(/:/, " ", mod);
> +               }
> +
> +               if (match(s, /DKBUILD_MODFILE=[^ ]+/) > 0) {
> +                       mfn = substr(s, RSTART + 17, RLENGTH - 17);
> +                       gsub(/['"]/, "", mfn);
> +                       gsub(/:/, " ", mfn);
> +               }
> +       }
> +       close(fn);
> +
> +# tmp = $0;
> +# $0 = s;
> +# print mod " " mfn " " obj " " $NF;
> +# $0 = tmp;
> +
> +       # A single module (common case) also reflects objects that are not part
> +       # of a module.  Some of those objects have names that are also a module
> +       # name (e.g. core).  We check the associated module file name, and if
> +       # they do not match, the object is not part of a module.

You do not need to use KBUILD_MODNAME.

Just use KBUILD_MODFILE only.
If the same path is found in modules.builtin,
it is a built-in module.

Its basename is modname.

One more question in a corner case.

How does your code work when an object is shared
by multiple modules?

For example, set
  CONFIG_EDAC_SKX=y
  CONFIG_EDAC_I10NM=y

How is the address range of drivers/edac/skx_common.o handled?

There are 4 possibilities.

 - included in skx_edac
 - included in i10nm_edac
 - included in both of them
 - not included in any of them

The correct behavior should be "included in both of them".

How does your code work?

> +       if (mod !~ / /) {
> +               if (!(mod in mods))
> +                       return "";
> +               if (mods[mod] != mfn)
> +                       return "";
> +       }
> +
> +       # At this point, mod is a single (valid) module name, or a list of
> +       # module names (that do not need validation).
> +       omod[obj] = mod;
> +       close(fn);
> +
> +       return mod;
> +}
> +
> +FNR == 1 {
> +       FC++;
> +}
> +
> +# (1-old) Build a mapping to associate object files with built-in module names.
> +#
> +# The first file argument is used as input (modules.builtin.objs).
> +#
> +FC == 1 && old_behaviour {
> +       sub(/:/, "");
> +       mod = $1;
> +       sub(/([^/]*\/)+/, "", mod);
> +       sub(/\.o$/, "", mod);
> +       gsub(/-/, "_", mod);
> +
> +       if (NF > 1) {
> +               for (i = 2; i <= NF; i++) {
> +                       if ($i in mods)
> +                               mods[$i] = mods[$i] " " mod;
> +                       else
> +                               mods[$i] = mod;
> +               }
> +       } else
> +               mods[$1] = mod;
> +
> +       next;
> +}

Please remove the old code.

> +# (1) Build a lookup map of built-in module names.
> +#
> +# The first file argument is used as input (modules.builtin.modinfo).
> +#
> +# We are interested in lines that follow the format
> +#     <modname>.file=<path>
> +# and use them to record <modname>
> +#
> +FC == 1 && /^[^\.]+.file=/ {
> +       gsub(/[\.=]/, " ");
> +# print $1 " -> " $3;
> +       mods[$1] = $3;
> +       next;
> +}

I guess parsing module.builtin will be simpler.

> +
> +# (2) Determine the load address for each section.
> +#
> +# The second file argument is used as input (vmlinux.map).
> +#
> +# Since some AWK implementations cannot handle large integers, we strip of the
> +# first 4 hex digits from the address.  This is safe because the kernel space
> +# is not large enough for addresses to extend into those digits.
> +#
> +FC == 2 && /^\./ && NF > 2 {
> +       if (type)
> +               delete sect_addend[type];
> +
> +       if ($1 ~ /percpu/)
> +               next;
> +
> +       raw_addr = $2;
> +       addr_prefix = "^" substr($2, 1, 6);
> +       sub(addr_prefix, "0x", $2);
> +       base = strtonum($2);
> +       type = $1;
> +       anchor = 0;
> +       sect_base[type] = base;
> +
> +       next;
> +}
> +
> +!type {
> +       next;
> +}
> +
> +# (3) We need to determine the base address of the section so that ranges can
> +# be expressed based on offsets from the base address.  This accommodates the
> +# kernel sections getting loaded at different addresses than what is recorded
> +# in vmlinux.map.
> +#
> +# At runtime, we will need to determine the base address of each section we are
> +# interested in.  We do that by recording the offset of the first symbol in the
> +# section.  Once we know the address of this symbol in the running kernel, we
> +# can calculate the base address of the section.
> +#
> +# If possible, we use an explicit anchor symbol (sym = .) listed at the base
> +# address (offset 0).
> +#
> +# If there is no such symbol, we record the first symbol in the section along
> +# with its offset.
> +#
> +# We also determine the offset of the first member in the section in case the
> +# final linking inserts some content between the start of the section and the
> +# first member.  I.e. in that case, vmlinux.map will list the first member at
> +# a non-zero offset whereas vmlinux.o.map will list it at offset 0.  We record
> +# the addend so we can apply it when processing vmlinux.o.map (next).
> +#
> +FC == 2 && !anchor && raw_addr == $1 && $3 == "=" && $4 == "." {
> +       anchor = sprintf("%s %08x-%08x = %s", type, 0, 0, $2);
> +       sect_anchor[type] = anchor;
> +
> +       next;
> +}
> +
> +FC == 2 && !anchor && $1 ~ /^0x/ && $2 !~ /^0x/ && NF <= 4 {
> +       sub(addr_prefix, "0x", $1);
> +       addr = strtonum($1) - base;
> +       anchor = sprintf("%s %08x-%08x = %s", type, addr, addr, $2);
> +       sect_anchor[type] = anchor;
> +
> +       next;
> +}
> +
> +FC == 2 && base && /^ \./ && $1 == type && NF == 4 {
> +       sub(addr_prefix, "0x", $2);
> +       addr = strtonum($2);
> +       sect_addend[type] = addr - base;
> +
> +       if (anchor) {
> +               base = 0;
> +               type = 0;
> +       }
> +
> +       next;
> +}
> +
> +# (4) Collect offset ranges (relative to the section base address) for built-in
> +# modules.
> +#
> +FC == 3 && /^ \./ && NF == 4 && $3 != "0x0" {
> +       type = $1;
> +       if (!(type in sect_addend))
> +               next;

This assumes sections are 1:1 mapping
between vmlinux.o and vmlinux.

How far does this assumption work?

CONFIG_LD_DEAD_CODE_DATA_ELIMINATION will not work
at least.

As I said in the previous review,
gawk is not documented in Documentation/process/changes.rst

Please add it if you go with gawk.

> +
> +       sub(addr_prefix, "0x", $2);
> +       addr = strtonum($2) + sect_addend[type];
> +
> +       mod = get_module_info($4);
> +# printf "[%s, %08x] %s [%s] %08x\n", mod_name, mod_start, $4, mod, addr;
> +       if (mod == mod_name)
> +               next;
> +
> +       if (mod_name) {
> +               idx = mod_start + sect_base[type] + sect_addend[type];
> +               entries[idx] = sprintf("%s %08x-%08x %s", type, mod_start, addr, mod_name);
> +               count[type]++;
> +       }
> +# if (mod == "")
> +# printf "ENTRY WITHOUT MOD - MODULE MAY END AT %08x\n", addr
> +
> +       mod_name = mod;
> +       mod_start = addr;
> +}
> +
> +END {
> +       for (type in count) {
> +               if (type in sect_anchor)
> +                       entries[sect_base[type]] = sect_anchor[type];
> +       }
> +
> +       n = asorti(entries, indices);
> +       for (i = 1; i <= n; i++)
> +               print entries[indices[i]];
> +}
> --
> 2.43.0
>

--
Best Regards

Masahiro Yamada