On 1/27/22 7:10 AM, Shung-Hsi Yu wrote:
Hi, We recently run into module load failure related to split BTF on openSUSE Tumbleweed[1], which I believe is something that may also happen on other rolling distros. The error looks like the follow (though failure is not limited to ipheth) BPF:[103111] STRUCT BPF:size=152 vlen=2 BPF: BPF:Invalid name BPF: failed to validate module [ipheth] BTF: -22 The error comes down to trying to load BTF of *kernel modules from a different build* than the runtime kernel (but the source is the same), where the base BTF of the two build is different. While it may be too far stretched to call this a bug, solving this might make BTF adoption easier. I'd natively think that we could further split base BTF into two part to avoid this issue, where .BTF only contain exported types, and the other (still residing in vmlinux) holds the unexported types.
What is the exported types? The types used by export symbols? This for sure will increase btf handling complexity.
Does that sound like something reasonable to work on? ## Root case (in case anyone is interested in a verbose version) On openSUSE Tumbleweed there can be several builds of the same source. Since the source is the same, the binaries are simply replaced when a package with a larger build number is installed during upgrade. In our case, a rebuild is triggered[2], and resulted in changes in base BTF. More precisely, the BTF_KIND_FUNC{,_PROTO} of i2c_smbus_check_pec(u8 cpec, struct i2c_msg *msg) and inet_lhash2_bucket_sk(struct inet_hashinfo *h, struct sock *sk) was added to the base BTF of 5.15.12-1.3. Those functions are previously missing in base BTF of 5.15.12-1.1.
As stated in [2] below, I think we should understand why rebuild is triggered. If the rebuild for vmlinux is triggered, why the modules cannot be rebuild at the same time?
The addition of entries in BTF type and string table caused extra offset of type IDs and string position in the base BTF, and as such the same type ID may refers to a totally different type, and as does name_off of types. When users on build#1 (ie 5.15.12-1.1) installs build#3 (ie 5.15.12-1.3), and then tries to load kernel module, they will be loading build#3 module on build#1 kernel; and with base BTF of the two builds different, name_off of some types will end up pointing at invalid string, and the kernel bails out. Best, Shung-Hsi Yu 1: https://bugzilla.opensuse.org/show_bug.cgi?id=1194501 2: my guess is rebuild is trigger due to compiler toolchain update, but I wasn't able to pin down exactly what changed