(Removing linux-fsdevel from the cc list since this is an ext4 specific issue.) On Mon, Mar 28, 2022 at 09:38:18PM +0530, Fariya F wrote: > Hi Ted, > > Thanks for the response. Really appreciate it. Some questions: > > a) This issue is observed on one of the customer board and hence a fix > is a must for us or at least I will need to do a work-around so other > customer boards do not face this issue. As I mentioned my script > relies on df -h output of used percentage. In the case of the board > reporting 16Z of used space and size, the available space is somehow > reported correctly. Should my script rely on available space and not > on the used space% output of df. Will that be a reliable work-around? > Do you see any issue in using the partition from then or some where > down the line the overhead blocks number would create a problem and my > partition would end up misbehaving or any sort of data loss could > occur? Data loss would be a concern for us. Please guide. I'm guessing that the problem was caused by a bit-flip in the superblock, so it was just a matter of hardware error. What version of e2fsprogs are using, and did you have metadata checksum (meta_csum) feature enabled? Depending on where the bit-flip happened --- e.g., whether it was in memory and then superblock was written out, or on the eMMC or other storage device --- if the metadata checksum feature caught the superblock error, it would have detected the issue, and while it would have required a manual fsck to fix it, at that point it would have fallen back to use the backup superblock version. > b) Any other suggestions of a work-around so even if the overhead > blocks reports more blocks than actual blocks on the partition, i am > able to use the partition reliably or do you think it would be a > better suggestion to wait for the fix in e2fsprogs? > > I think apart from the fix in e2fsprogs tool, a kernel fix is also > required, wherein it performs check that the overhead blocks should > not be greater than the actual blocks on the partition. Yes, we can certainly have the kernel check to see if the overhead value is completely insane, and if so, recalculate it (even though it would slow down the mount). Another thing we could do is to always recaluclate the overhead amount if the file system is smaller than some arbitrary size, on the theory that (a) for small file systems, the increased time to mount the file system will not be noticeable, and (b) embedded and mobile devices are often where "cost optimized" (my polite way of saying crappy quality to save a pentty or two in Bill of Materials costs) are most likely, and so those are where bit flips are more likely. Cheers, - Ted