On Sun, Dec 06, 2020 at 10:15:27AM -0500, Theodore Y. Ts'o wrote: > On Sun, Dec 06, 2020 at 02:44:16PM +0000, Colin Watson wrote: > > Now that I look at it more closely, some of the changes to > > clean_grub_dir_real look suspicious: > > > > + char *srcf = grub_util_path_concat (2, di, de->d_name); > > + > > + if (mode == CREATE_BACKUP) > > + { > > + char *dstf = grub_util_path_concat_ext (2, di, de->d_name, "-"); > > + if (grub_util_rename (srcf, dstf) < 0) > > + grub_util_error (_("cannot backup `%s': %s"), srcf, > > + grub_util_fd_strerror ()); > > + free (dstf); > > + } > > ... however, if I'm understanding the code correctly, this is the > codepath used to create the backup file (e.g., the previous version of > boot.img). So shouldn't there be a "boot.img" file in > /boot/grub/i386-pc which would be the newly installed version of that > file, and so the system would actually be booting correctly? Not quite. What's described here as "backup/restore" thing is used as follows: * rename old modules aside as a backup * do the rest of the installation (writing to the MBR or similar, as well as copying in new modules) * if installation succeeds, remove the backup files * if installation fails, then: * remove the newly-created modules * move the backup files back into place But if the restored file names are computed wrongly, then this leaves the system in a bad state as Paul described. I don't know why Dimitri chose to explicitly remove the new files first rather than just renaming over the top and then removing any leftovers at the end; that seems unnecessarily risky. Though this is code that's apparently supposed to work on Windows as well, and the MoveFile function that's used to implement grub_util_rename there requires the destination file not to exist (sigh), so maybe it had something to do with that. > Essentially, there are three possibilities: > > 1) A hardware corruption which corrupted the directory. > > 2) A kernel bug which corrupted the directory. > > 3) The file system isn't actually corrupted, but the filename with the > random garbage in the filename was created because a userspace > application so requested it. > > The fact that all of the filenames have the a similar pattern of > corruption to them would tend to rule out #1. And the fact that > e2fsck didn't notice any other corruptions would tend to argue against > #1 and #2. So #3 does seem to be the most likely. Yep. -- Colin Watson (he/him) [cjwatson@xxxxxxxxxx]