[cc:ing honeycomb-users, didn't think of that earlier] On Mon, Feb 10, 2020 at 5:16 PM Russell King - ARM Linux admin <linux@xxxxxxxxxxxxxxx> wrote: > > On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote: > > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin > > <linux@xxxxxxxxxxxxxxx> wrote: > > > > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@xxxxxxx> wrote: > > > > > > > > > > Hi Olof, > > > > > > > > > > Thanks a lot for your comments! > > > > > And sorry for my delay respond! > > > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > > enable full use of a promising ARM developer system, the SolidRun > > > > HoneyComb, and would be quite valuable for me and others to be able to > > > > use with mainline or -next without any additional patches applied -- > > > > which this patchset achieves. > > > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > > to you and others to determine if that can be done with incremental > > > > patches on top, or if it should be fixed before the initial patchset > > > > is applied. But all in all, it's holding up adaption by me and surely > > > > others of a very interesting platform -- I'm looking to replace my > > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > > before I do. > > > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > > version; I've already had one instance where ext4 failed to mount > > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > > Serror'd, and then I powered it down after a few hours before later > > > booting it back up. > > > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > > EXT4-fs (nvme0n1p2): error loading journal > > > > Hmm, using btrfs on mine, not sure if the exposure is similar or not. > > As I understand the problem, it isn't a filesystem issue. It's a data > integrity issue with the NVMe over power fail, how they cache the data, > and ultimately write it to the nand flash. > > Have a read of: > > https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection > > As NVMe and SSD are basically the same underlying technology (the host > interface is different) and the issues I've heard, and now experienced > with my NVMe, I think the above is a good pointer to the problems of > flash mass storage. > > As I understand it, the problem occurs when the mapping table has not > been written back to flash, power is lost without the Standby Immediate > command being sent, and there is no way for the firmware to quickly > save the table. On subsequent power up, the firmware has to > reconstruct the mapping table, and depending on how that is done, > incorrect (old?) data may be returned for some blocks. > > That can happen to any blocks on the drive, which means any data can > be at risk from a power loss event, whether that is a power failure > or after a crash. Makes me suspect if there's some board-level power/reset sequencing issue, or if there's a problem with one card going down disabling others. I haven't read the specs enough to know what's expected behavior but I've seen similar issues on other platforms so take it with a grain of salt. > > Do you know if the SErr was due to a known issue and/or if it's > > something that's fixed in production silicon? > > The SError is triggered by something on the PCIe side of things; if I > leave the Mellanox PCIe card out, then I don't get them. The errata > patches I have merged into my tree help a bit, turning the code from > being unable to boot without a SError with the card plugged in, to > being able to boot and last a while - but the SErrors still eventually > come, maybe taking a few days... and that's without the Mellanox > ethernet interface being up. > > > (I still can't enable SMMU since across a warm reboot it fails > > *completely*, with nothing coming up and working. NXP folks, you > > listening? :) > > Is it just a warm reboot? I thought I saw SMMU activity on a cold > boot as well, implying that there were devices active that Linux > did not know about. Yeah, 100% reproducible on warm reboot -- every single time. Not on cold boot though (100% success rate as far as I remember). I boot with kernel on NVMe on PCIe, native 1GbE for networking. u-boot from SD card. This is with the SolidRun u-boot from GitHub. -Olof