On 8 December 2017 at 09:15, Andrew Worsley <amworsley@xxxxxxxxx> wrote: > Thanks. I regularly see the warning my patch prints out on boot up as then > there is a big pull of random data from the TPM. Just an update on this bug - I did a i2c capture with pulseview on a boot that included this issue. > On 8 Dec 2017 12:30 am, "Harini Katakam" <harinikatakamlinux@xxxxxxxxx> > wrote: >> >> Hi Wolfram and Andrew, >> >> On Thu, Dec 7, 2017 at 4:19 PM, Wolfram Sang <wsa@xxxxxxxxxxxxx> wrote: >> > >> >> > I attach a patch of the changes to the i2c driver showing my debug in >> >> > this driver. >> >> > As I said I am happy to send more verbose bug output - I have about >> >> > 36 >> >> > odd runs with 9 occurences of the bug. >> >> > Once it happen twice in the one run >> >> > >> >> > I include the crashing line from the runs where the bug occured (I >> >> > added more diagnostics as the runs progressed): >> >> >> >> I'll check with out HW team as well and check your flow to >> >> see if we can reproduce it; will also check if the existing errata >> >> has larger scope. >> >> I'll let you know if we require further assistance with debug >> >> next week. >> > >> > Any news here? >> >> Sorry for the delay. >> I've checked the known errata and confirmed that it wont >> affect this use case. And we haven't been able to hit this >> issue on our HW (with eeprom slave devices). >> I will add my colleague looking into this further to debug. >> >> Regards, >> Harini I suspect a well behaved i2c slave may never show the issue. But the i2c cadence doesn't handle bad slaves safely so I suggest that length check is needed to avoid buffer kernel crashes due to mis-behaving i2c hardware. I got a board that this issue was *very* bad to the point that the TPM_RANDOM requests repeatedly returned the same data and so were failing the sanity tests. So I captured the i2c transactions using pulseview and very bad behaviour from the TPM. The problem disappears at 100KHz and very reliably occurs at 400k on this individual board so I believe this TPM is failing to keep up with the i2c clock and screwying everything up. What I am concerned about is that the i2c-cadence driver is almost always carrying on and returning data with out errors. In particular I see repeated starts and long periods where the data line is held low and NO Bus Arbitration or Time out errors. In particular I can get repeated TPM_RANDOM requests to the TPM returning the exact same data until another i2c operation to a different chip (RTC) which works fine at this 400kHz. So to summarize most (all?) other boards work at this 400kHz speed so I definitely have a marginal TPM chip on this board. But why doesn't the i2c-cadence HW pick up these faults? The i2c transactions decoded by pulseview show a a real mess but i2c-cadence is bravely pulling data out of it. I will attempt to append a few images of i2c traces - to illustrate the mess that happens but that the i2c cadence hardware is very rarely reporting any problems - but returns invalid data. In about 15+ boots I got one Bus arbitration. After looking at the i2c traces it is not surprising that the cadence i2c hardware can become confused and return way more data than was requested. As the buffer overrun can cause a crash in a different routine (after the i2c driver has finished) I think it is prudent to keep some sort of check on the data size returned. Andrew read-overview image is the large TPM read I suspect where things go wrong showing lots of i2c bursts on one transactions with the clock+data held low between them and actually ends with the data held low. It is not fixed until *much* later way past the end of that image. The read-start image is the very start of this long transaction showing the read to TPM at address 0x29 - but the pulseview decode is correctly reporting repeated starts although it looks like the master isn't detecting it - there are many many occurrences of these sorts of problems
Attachment:
tpm-random-read-overview.png
Description: PNG image
Attachment:
tpm-random-read-start.png
Description: PNG image