On 11/10/18 1:03 AM, Juergen Gross wrote: > > How would that help? The garabge data written could have the correct > terminal sentinel value by chance. > > That's why I re-used an existing field in setup_header (the version) to > let grub tell the kernel which part of setup_header was written by grub. > > That's the only way I could find to let the kernel distinguish between > garbage and actual data. There is plenty of space *before* the setup_header part of struct boot_params too -- look a the various __pad fields, especially (in your case), __pad3[16] and __pad4[116] would suit the bill just fine. >> It would be enormously helpful if you could find out any more details about >> exactly what they are doing to break things. > > That's easy: > > The memory layout is: > > 0x1f1 bytes of data, including the sentinel, the setup_header, and then > more data. > > grub did read the kernel's setup_header in the correct size into its > buffer (which contains random garbage before that), intializes the first > 0x1f1 including the sentinel byte, and then writes back the buffer, but > using a too large length for that. Are you kidding me... it really overwrites it with completely random data, and not simply overspilling contents of the file? In that case it might not be possible (or desirable) to use those N bytes following the setup_heaader, or we need to a bigger sentinel than one byte (probability being what it is, 256^n gets to be a pretty big number for any n, very quickly drowning in the noise compared to other potential sources of boot failures, and most likely less fatal than most.) How big is this garbage dump? I'm going to brave a guess it is 512 bytes. In that case this is hardly a big deal: the E820 map begins at 0x290, and the setup_header maximum goes to 0x280, so it is only 15 bytes lost. If it is worse than that, we would risk losing __pad8[48] and __pad9[276], and especially the latter would be painful. In those case perhaps we should use 0x281..0x290 as a 15-byte sentinel; that is going to be virtually foolproof. I'm also thinking that it might be desirable to add a flags field (__pad2 would be ideal) to struct boot_params; it would let us recycle some of the obsolete fields (hd0_info, hd1_info, sys_desc_table, olpc_ofw_header, ...) and perhaps be able to add some more robustness against these sort of things. This would be the right way to do what your version feedback mechanism would do. The reason why the feedback mechanism is fundamentally broken is that it only gives the boot loader a way to assert that it supports a certain version of the protocol, but it doesn't say *which* bootloader does such an assert -- and therefore it is still wide open to implementation error. We do, in fact, already have a feedback mechanism: the bootloader ID and bootloader version. One way we could deal with this problem is to bump the bootloader version reported by Grub, and add a quirk to the kernel that if the bootloader ID is Grub (7) and the version is less than a certain number, zero those fields. In fact, the more I think about it, this is what we should do. That being said, Grub really needs to be kept honest. They keep making both severe design (like refusing to use the BIOS and UEFI entry points provided by the kernel by default) and implementation errors, apparently without meaningful oversight. I really ask that the people more closely involved with Grub try to keep a closer eye on their code as it applies to Linux. -hpa