On Wed, Oct 23, 2019 at 7:16 PM Pali Rohár <pali.rohar@xxxxxxxxx> wrote: > On Wednesday 23 October 2019 16:21:19 Chris Murphy wrote: > > I don't know either or how to confirm it. > > Somebody who is watching linuxfs-devel and has deep knowledge in this > area... could provide more information. Maybe dm-log-writes can do this? Just log all the writes, and hopefully it's straightforward to match the 'mv' rename command with the resulting writes. > > Nice in theory, but in practice the user simply reboots, and screams > > WTF outloud if the system face plants. And people wonder why things > > are still broken 20 years later with all the same kinds of problems > > and prescriptions to boot off some rescue media instead of it being > > fail safe by design. It's definitely not fail safe to have a kernel > > update that could possibly result in an unbootable system. I can't > > think of any ordinary server, cloud, desktop, mobile user who wants to > > have to boot from rescue media to do a simple repair. Of course they > > all just want to reboot and have the right thing always happen no > > matter what, otherwise they get so nervous about doing updates that > > they postpone them longer than they should. > > Still, in any time when you improperly unmount filesystem you should > check for error, if you do not want to loose your data. Perhaps, but it's archaic. The user usually has no idea what went wrong, and all kinds of factors strongly disincentivize doing an offline fsck, and incentivize just rebooting and seeing what happens. If they get past the bootloader, systemd/init is going to run an fsck on all volumes that need it or kernel code does log replay to make them up to date. > And critical area should have some "recovery" mechanism to repair broken > bootloader / kernel image. > > Anyway, chance that kernel crashes at step when replacing old kernel > disk image by new one is low. So it should not be such big issue to need > to do external recovery. 'strace -D -ff -o' on grub2-mkconfig causes over 1800 PID files to be generated. Filtering for lines containing grub.cfg... # grep grub.cfg * grub.12167:execve("/usr/sbin/grub2-mkconfig", ["grub2-mkconfig", "-o", "/boot/efi/EFI/fedora/grub.cfg"], 0x7ffc68054470 /* 24 vars */) = 0 grub.12167:read(3, "/boot/efi/EFI/fedora/grub.cfg\n", 128) = 30 grub.12167:openat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg.new", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 grub.12167:read(255, "\nif test \"x${grub_cfg}\" != \"x\" ;"..., 8192) = 567 grub.12174:write(1, "/boot/efi/EFI/fedora/grub.cfg\n", 30) = 30 grub.12349:execve("/usr/bin/rm", ["rm", "-f", "/boot/efi/EFI/fedora/grub.cfg.ne"...], 0x55c599fde980 /* 48 vars */) = 0 grub.12349:newfstatat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg.new", 0x556be17d9758, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) grub.12349:unlinkat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg.new", 0) = -1 ENOENT (No such file or directory) grub.14064:execve("/usr/bin/grub2-script-check", ["/usr/bin/grub2-script-check", "/boot/efi/EFI/fedora/grub.cfg.ne"...], 0x55c599fde980 /* 48 vars */) = 0 grub.14064:openat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg.new", O_RDONLY) = 3 grub.14065:openat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 grub.14065:execve("/usr/bin/cat", ["cat", "/boot/efi/EFI/fedora/grub.cfg.ne"...], 0x55c599fde980 /* 48 vars */) = 0 grub.14065:openat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg.new", O_RDONLY) = 3 grub.14066:execve("/usr/bin/rm", ["rm", "-f", "/boot/efi/EFI/fedora/grub.cfg.ne"...], 0x55c599fde980 /* 48 vars */) = 0 grub.14066:newfstatat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg.new", {st_mode=S_IFREG|0700, st_size=6080, ...}, AT_SYMLINK_NOFOLLOW) = 0 grub.14066:unlinkat(AT_FDCWD, "/boot/efi/EFI/fedora/grub.cfg.new", 0) = 0 I'm not able to parse this. My best guess is it's writing out an all new file, grub.cfg.new, and then doesn't rename it. Instead it uses cat to copy the contents of the new file and overwrites the old one? Yeah, the inode stays the same, as does access time. Is this fragile? Android and ChromeOS and some others, have A and B kernel partitions which are just blobs. They use some other form of hint to indicate which partition is actually used at one time, meaning they can reliably ensure a failsafe update of the other partition, and sanity testing it, before committing the switch. Crude but effective. Apple goes so far as to get all of their product firmware the ability to natively read APFS, which contains the kernel and early boot files. I have no idea how Windows does kernel or bootloader updates, except they don't keep the EFI system partition persistently mounted all day long, like virtually all Linux distributions today, at /boot/efi - that does seem guaranteed to result in many dirty flag FAT file system cleanups. I know I've seen such fix ups in my journal files. > > > > I'm not sure how to test the following: write kernel and initramfs to > > > > final locations. And bootloader configuration is written to a temp > > > > path. Then at the decision moment, rename it so that it goes from temp > > > > path to final path doing at most 1 sector change. 1 512 byte sector > > > > is a reasonable number to assume can be completely atomic for a > > > > system. I have no idea if FAT can do such a 'mv' event with only one > > > > sector change > > > > > > Theoretically it could be possible to implement it for FAT (with more > > > restrictions), but I doubt that general purpose implementation of any > > > filesystem in kernel can do such thing. So no practically. > > > > Now I'm wondering what the UEFI spec says about this, and whether this > > problem was anticipated, and how surprised I should be if it wasn't > > anticipated. > > I know that UEFI spec has reference for FAT filesystems to MS > specification (fagen103.doc). I do not know if it says anything about > filesystem details, but I guess it specify requirements, that > implementations must be compatible with FAT12, FAT16 and FAT32 according > to specification. My understanding of the UEFI spec is the file system is called the 'EFI file system' and was intended to be predicated on FAT12, FAT16, FAT32 at a specific moment in time, bugs and warts and all. By now probably around 20 years ago. And then not ever changed. In practice it seems there is no such separate thing as the EFI file system. No separate mkfs flag, or mount options, to make sure this is *the* canonical EFI file system, rather than just today's latest bug fixed and feature enhanced FAT file system as supported by Linux. So god only knows what bugs might arise from that discrepancy one day. > Also UEFI allows you to write our own UEFI filesystem drivers which > other UEFI programs and bootloaders can use. I'm not finding it this second but someone basically did this work already, but wrapping existing GRUB file system modules into EFI file system drivers. OK so plausibly on UEFI, it could be handed a better FAT driver very soon after POST to avoid firmware FAT bugs. Or for that matter, create "A" and "B" EFI system partitions, containing identical static boot data, that merely points to a purpose built $BOOT volume that can host early boot files and supports atomic updates. That'd be clever, but also not generic. It's UEFI specific. It'd be neat to have a superset implementation that can work anywhere. But then allow for optimizations. But the problem with the generic solution? Who will follow it? The Bootloaderspec pretty much fell on deaf ears. The GRUB folks don't care to upstream it, nor sysliux, nor uboot near as I can tell. Simple 1 page spec. Fedora's GRUB carries patches for it, and now uses them by default. Son hilariously Fedora is maybe the first distribution to actively support three substantially different bootloader update mechanisms: grub-mkconfig, grubby, and bootloaderspec. -- Chris Murphy