Re: i2c-tools - at24 vs eeprom - decode-dimms fails with the at24 module

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi James,

On Sun, 15 Mar 2020 16:49:05 -0600, James Feeney wrote:
> 1) Simply saying that "the legacy eeprom driver (poorly) auto-detected SPD EEPROMs" is *not* communicating an explanation as to *how* the at24 driver is "better" than the eeprom driver.  So, I'm at a loss there, noting, in contrast, that the eeprom driver "just works".  The at24 driver clearly did not "fix" this "poor auto-detection" behavior.

If you were one of these users who got their Thinkpad laptop killed by
the eeprom driver, you would certainly consider the at24 driver to be
an improvement ;-)

The at24 driver indeed did not fix the auto-detection, instead the
auto-detection was removed from the driver completely by design. In the
Linux device driver model, it is not the driver's job to determine
whether the device is present. Instead, an external source (PCI device
enumeration, OF, user-space...) will create the device, and that's what
will cause the relevant driver to be loaded and have a chance to bind
to the device. The at24 driver fits in that model.

The change is not specific to the eeprom driver. Most drivers inherited
from the lm-sensors project have been converted to support the clean
device driver model. Many still optionally support device
auto-detection, but in some cases that was dropped completely, for
safety or reliability reasons.

Also note that the eeprom driver does not support DDR4 memory modules,
which use a different EEPROM standard compared to previous generations.
So everyone will have to move away from the eeprom driver eventually.

Anyway I agree that the migration from the eeprom driver to the at24
driver is not easy for the users, and I understand that having to move
from "just modprobe" to "obscure machine-specific sysfs magic" can been
seen as a regression from a usability perspective. This is the reason
for my project to re-introduce auto-detection at a different point in
the chain, where it actually belongs. But it will take some time.

I have resubmitted the patches this morning.

> 2) Where i2cdetect says "WARNING! This program can confuse your I2C bus, cause data loss and worse!", and where the document "instantiating-devices" says "Given the huge number of mainboards out there, it is next to impossible to build an exhaustive list of the hardware monitoring chips being used", and "Explicit device instantiation (methods 1 and 2) is much preferred for it is safer", fails to disclose the actual "measure" of "confuse", "unsafe", and "worse".  Again, I am at a loss, since, together, this merely suggests that i2c device auto-detection and enumeration is a "hard" and "unsolved" problem.  Is it really an "unsolved" problem?  i2cdetect seems to "work for me", and I've never seen a problem arise from running sensors-detect.

These warnings are there for as long as I can remember, I believe they
are unnecessarily alarmist nowadays. The reason for their presence is
that we used to have several cases where people have actually killed
pricey hardware by running i2cdetect or sensors-detect. Meanwhile, we
have added workarounds for the known faulty devices, and these machines
are probably no longer used today anyway, but this clearly invited the
developers to be very cautious and to make it clear to the user that
running these tools is not 100% safe and can't be made 100% safe,
unlike "lspci" for example.

Our i2c documentation is being updated at this moment, both format and
contents, so hopefully it will be better soon. If issues remain after
that, we'll look into them.

To answer your immediate questions, I2C only specifies how bits fly
over the wires. It does not specify the meaning of these bits. Our
tools (specifically i2cdetect) is based on the most common use of I2C
transactions, but there is no guarantee that your system does not
include some weird I2C device which interprets the transactions
completely differently. This can go as far as what we consider to be a
read transaction being interpreted as a write transaction by the device.

Sensors-detect runs a huge number of I2C transactions on a large number
of slave addresses. It is by far the most intrusive and dangerous tool
in this respect. Over time I (and maybe others) have added many quirks
to mitigate the risk, but it's still not 100% safe and I have seen
SMBus lock-ups or even instant reboots as a result of running
sensors-detect on some machines even recently. Thankfully no permanent
damage though.

> 3) Noting the disclaimers with "[PATCH 3/4] i2c: smbus: Add a way to instantiate SPD EEPROMs automatically", which has "Only works if all filled slots have the same memory type" and "Only works on systems with 1 to 4 memory slots", I can observe that this patch would not even help, in my case, where the board has 6 memory slots.

This is still work in progress and we will go step by step. The last
thing I want is that this patch causes trouble to dozens of users, gets
reverted, and the whole idea rejected. I prefer to cover half of the
users properly first, so as to prove the viability of the idea. What is
needed to properly support the remaining users can then be investigated.

If the you want to try a version of the patch that would support your
system, I can provide that. I can even provide an out-of-tree version of
the i2c-i801 driver for you to build and test, if you are interested.

> What did you mean by "make it transparent to the users"?  What is "it"?  Are you describing "auto-detection"?  Or, the difficulty of "auto-detection" and the need for manual enumeration?

"it" in that sentence designated the transition from the eeprom driver
to the at24 driver.

> Given the circumstances, the sort of resolutions that come to mind include:
> 
> 1) Provide more "intrusive" documentation for the at24 driver, so that that users will be informed about the need for manual explicit device enumeration with this driver, for instance, presenting the information you provided to me in your email.  How, in the boot process, will you recommend that this manual explicit device enumeration be automated?

I'd rather document it in decode-dimms, because the at24 driver is only
one of 2 drivers that you may have to use (ee1004 for DDR4 being the
other one).

Automation at boot is tricky because it depends on the distribution. I
can't write documentation that will cover all distributions out there,
and more importantly I don't want to have to maintain such
documentation. I believe the user should look for documentation on the
distribution side, as running commands at boot time is a generic need
that most distributions out there already document one way or another.

> 2) Or, incorporate the functionality of i2cdetect into the at24 driver itself, perhaps with a parameter to enable or disable this "feature", *and* explain in detail the actual nature and "measure" of possible corruption from auto-detection - "worse", "unsafe", "confuse".

The functionality of i2cdetect is not enough. i2cdetect tells you which
I2C addresses are responsive. It does not tell you what is the slave
device which responded. This is the core of the problem. If you see a
slave answer at I2C address 0x54, is it a DDR2 or DDR3 SPD EEPROM (that
would need the at24 driver)? A DDR4 SPD EEPROM (that would need the
ee1004 driver)? An AT24RF08 device that risk to get corrupted if you
access it because of a bug in its state machine implementation?

There's a reason for the device driver model and there's a reason why
the at24 driver sticks to that model. Adding auto-detection to the at24
driver itself would be a step back, and clearly in the wrong direction.
Not happening, sorry.

> Not articulating and enumerating instances of "worse", "unsafe", and "confuse", is not a "solution" to the auto-detection problem, but is just avoiding the problem.  I would be better to help the user deal with the issue.

I don't disagree. My take at the problem is to try hard to make it work
out of the box (even better than before, as the proper drivers will get
loaded automatically, unlike "eeprom"). I don't want to waste too much
time documenting a temporary situation.

> Of course, auto-detection sounds hard, and I'm being naive, so forgive me.

You underestimate the horror. It is hard, unsafe, and broken by design.

What should really happen is that the firmware enumerates the memory
slots with all the details we need, so that no detection is needed by
the OS at all. DMI provides some of the information, but not all. ACPI
could (should) do it as well, but doesn't. I'm trying to do the best we
can with the little information we have.

> I would argue that it is better to provide automatic detection and a warning to "check" that the auto-detection "worked" without causing problems, than to provide *no* auto-detection and remain silent.

And you would be wrong, sorry. SPD support at the OS level is nice to
have but certainly not mandatory. So if blindly automating it would
result in even just 0.01 % of systems failing to work properly then
that's not acceptable.

> In some way, proper use of the at24 driver must be easily "discoverable" by the user.  Better to provide a path to follow, than just failing silently, though maybe it is less challenging to just provide better documentation for the at24 driver.

The fact that nothing happens when you load the at24 driver is by
design, that's not going to change.

Better documentation would clearly help, yes. I guess the problem is
that I'm too ashamed by how uneasy it is, to write it black on white.

> To the issue of manual instantiation, I found, after some reading and practice:
> 
> $ sudo i2cdetect -l
> i2c-3   i2c             Radeon i2c bit bus 0x93                 I2C adapter
> i2c-10  i2c             NVIDIA i2c adapter 0 at 3:00.0          I2C adapter
> i2c-1   i2c             Radeon i2c bit bus 0x91                 I2C adapter
> i2c-8   i2c             card0-DP-1                              I2C adapter
> i2c-6   i2c             Radeon i2c bit bus 0x96                 I2C adapter
> i2c-4   i2c             Radeon i2c bit bus 0x94                 I2C adapter
> i2c-11  i2c             NVIDIA i2c adapter 6 at 3:00.0          I2C adapter
> i2c-2   i2c             Radeon i2c bit bus 0x92                 I2C adapter
> i2c-0   i2c             Radeon i2c bit bus 0x90                 I2C adapter
> i2c-9   smbus           SMBus I801 adapter at 0400              SMBus adapter
> i2c-7   i2c             Radeon i2c bit bus 0x97                 I2C adapter
> i2c-5   i2c             Radeon i2c bit bus 0x95                 I2C adapter
> i2c-12  i2c             NVIDIA i2c adapter 10 at 3:00.0         I2C adapter
> 
> $ sudo i2cdetect 9
> WARNING! This program can confuse your I2C bus, cause data loss and worse!
> I will probe file /dev/i2c-9.
> I will probe address range 0x08-0x77.
> Continue? [Y/n]
>      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
> 00:                         08 -- -- -- -- -- -- --
> 10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- 1e --
> 20: 20 -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> 30: 30 31 32 -- -- -- -- 37 -- -- -- -- -- -- -- --
> 40: -- -- -- -- 44 -- -- -- -- -- -- -- -- -- -- --
> 50: 50 51 52 53 54 55 -- -- -- -- -- -- -- -- -- --
> 60: -- -- -- -- -- -- -- -- -- 69 -- -- -- -- -- --
> 70: -- -- -- -- -- -- -- --
> 
> $ sudo sh -c 'for i in 0x5{0..5};do echo spd $i > /sys/bus/i2c/devices/i2c-9/new_device;done'
> 
> $ decode-dimms
> ...
> Number of SDRAM DIMMs detected and decoded: 6

Looks good.

> So, that works, but it seems tedious.  Perhaps the same thing could be done by dynamically loading Device Tree Overlays, but that also seems tedious to set up.  I suppose that I would use:
> 
> install at24 /usr/bin/modprobe -r eeprom;/usr/bin/sleep 6;for i in 5{0..5};do [ ! -d /sys/bus/i2c/devices/i2c-9/9-00$i ] && echo spd 0x$i > /sys/bus/i2c/devices/i2c-9/new_device;done;/usr/bin/modprobe --ignore-install at24 $CMDLINE_OPTS
> 
> and
> 
> remove at24 for i in 5{0..5};do [ -d /sys/bus/i2c/devices/i2c-9/9-00$i -a `cat /sys/bus/i2c/devices/i2c-9/9-00$i/name` = spd ] && echo 0x$i > /sys/bus/i2c/devices/i2c-9/delete_device;done; /usr/bin/modprobe -r --ignore-remove at24 $CMDLINE_OPT

Please note that there is no guarantee that i2c-9 will always be the
SMBus I801 adapter. In most cases the I2C bus numbering is stable, but
that's not guaranteed, and a number of things could cause it to change
(driver loading order for example, device hot-plugging...)

> But then, we see at man 8 modprobe.d, regarding "install modulename command...":
> 
>       The long term future of this command as a solution to the problem of providing additional module
>       dependencies is not assured and it is intended to replace this command with a warning about its
>       eventual removal or deprecation at some point in a future release. Its use complicates the automated
>       determination of module dependencies by distribution utilities, such as mkinitrd (because these now
>       need to somehow interpret what the install commands might be doing. In a perfect world, modules
>       would provide all dependency information without the use of this command and work is underway to
>       implement soft dependency support within the Linux kernel.
>            ...
>  COMPATIBILITY
>       A future version of kmod will come with a strong warning to avoid use of the install as explained
>       above. This will happen once support for soft dependencies in the kernel is complete. That support
>       will complement the existing softdep support within this utility by providing such dependencies
>       directly within the modules.
> 
> This leads me to the impression that the at24 driver is being considered "faulty", simply because it fails to do its own configuration.  What are the "rules" about "proper" module and driver functionality?

The problem is that you are trying hard to have the at24 driver mimic
the eeprom driver's behavior. There's really no reason to do that.
There is no reason to put these sysfs attribute writes into the
modprobe rule. They are simple shell commands that should be included
in whatever boot-time init script your distribution implements. That
will even load the at24 driver automatically for you.

If anything, the at24 driver itself is the only sane thing in the whole
mess.

> But, you say the long-term plan is already to automate this, using the i2c-i801 SMBus controller driver?  So, modprobe is just a temporary work-around - hopefully.

Actually, writing to the sysfs attribute "new_device" is the temporary
workaround.

> The kernel *did* autoload the at24 driver at "echo spd 0x50 > /sys/bus/i2c/devices/i2c-9/new_device", but the at24 module must be explicitly loaded anyway, to actually invoke the modprobe command.

It must not. You decided to do it this way, and it looks convoluted to
me.

> Oddly, with the eeprom driver loaded, it is not possible to write to "delete_device", which instead responds with "sh: line 0: echo: write error: No such file or directory".  Removing eeprom will delete its devices, but removing at24 will not.

As I recall, delete_device can only delete devices which were created
by new_device, by design. Either you instantiate from user-space, or
from kernel-space, you can't mix.

> The documentation is contradictory about correct behavior.

Which documentation specifically?

> But then, if the at24 devices are configured, and the eeprom driver is loaded, the eeprom driver will not delete the at24 devices, and will fail to configure its own devices when loaded.

Correct, again on purpose. The eeprom driver can only bind to addresses
that haven't been claimed by anyone yet. By declaring "spd" devices
through sysfs, these addresses are no longer free to claim. Only a
driver which declared itself as supporting the "spd" devices will be
allowed to bind to them. That's the at24 driver.

> The at24 driver also cannot delete the eeprom devices, when both drivers are loaded.  Of course loading both drivers is not "normal".

The core of the problem is that these 2 drivers are using different
device driver models. Both are "supported" technically, although
clearly the at24 one is recommended while the eeprom one is deprecated.
They do not mix well. That being said, loading 2 drivers for the same
device never works well anyway.

> We see at Documentation/i2c/instantiating-devices:
> 
>  Method 2: Instantiate the devices explicitly
>  ...
>  The driver which instantiated the I2C device is responsible for destroying it on cleanup.

Typical case would be some platform specific driver instantiating
known-to-be-present I2C devices. The i2c-i801 driver already did that
for a limited number of I2C devices, and with my proposed patches that
would be extended to SPD EEPROMs.

> but then:
> 
>  Method 4: Instantiate from user-space
>  ...
>  You can also instantiate the device before the driver is loaded or even
>  available, and you don't need to know what driver the device needs.
> 
> Those "recommendations" or "rules" appear to be contradictory, and there is no author attribution with the Documentation to determine to whom to appeal for clarification.

"git blame" will always tell you, but I'll save you the effort, I'm the
one to appeal to for clarification ;-)

I'm sorry but I can't see how these recommendations are contradictory.
These are alternative ways to achieve the same thing, the user
shall pick one or the other depending on the situation. Right now the
at24 driver needs method 4 (on x86 PC, that is), and I'm trying to move
it to method 2 for the majority of users.

Method 4 differs from all others in that it happens in user-space, so
it in inherently very different from the others.

> As an aside, I notice that when the eeprom driver is removed at the same time that the at24 driver is being loaded, using the modprobe command shown above, but without the sleep, there are lots of errors showing with dmesg, whether or not the eeprom module was actually already loaded:
> 
> i2c i2c-9: new_device: Instantiated device spd at 0x50
> ...
> at24 9-0050: 256 byte spd EEPROM, read-only
> ...
> i2c i2c-9: Failed to register i2c client spd at 0x50 (-16)
> ...
> i2c i2c-9: Failed to register i2c client spd at 0x50 (-16)
> ...
> 
> but the result seems to be correct, with the at24 devices replacing the eeprom devices.  I don't know why these error messages would be generated, unless there is some race condition or something blocking. 

This is news to me, thanks for reporting. I suppose that part of the
removal process is asynchronous, so "modprobe -r" returns before
everything is actually cleaned up. I guess things are done this way for
performance reasons, but it can clearly cause confusion. Furthermore, I
would expect a subsequent "modprobe" to be serialized to prevent any
kind of race. Apparently that's not happening, but that would be a
module core issue, not specific to the eeprom and at24 drivers. Again
the root of the problem is having 2 drivers for the same device - that
never works well.

> Not running "modprobe -r eeprom" immediately before, and presumably overlapping with, "modprobe at24" avoids the error messages.  It's just confusing, since the message states that something failed, when in fact, everything worked.  Perhaps those misleading error messages arise from some odd behavior with "modprobe --remove".  Avoiding the error messages requires a substantial delay between the modprobe remove and install commands, at least 4 seconds, and sometimes 5 or more seconds.  And, as mentioned, removing the eeprom driver *after* the at24 driver is loaded does not actually create at24 spd devices.

My suggestion would be to simply not load the eeprom driver in the
first place. There's really no need to create more problems than you
already have.

> BTW, at https://git.kernel.org/pub/scm/utils/i2c-tools/i2c-tools.git/about/ we still see:
> 
> eeprom
>   Perl scripts for decoding different types of EEPROMs (SPD, EDID...) These
>   scripts rely on the "eeprom" kernel driver.
> 
> even though the eeprom driver is now deprecated.

Thanks for pointing it out, I'll fix that. I thought I had done that
already but that must have been somewhere else.

> I hope that these naive outsider observations will be helpful or useful in some way.  Of course, I appreciate the work you do, to make it all possible.

Yes, feedback is always appreciated. If nothing else it helps me know
where my time is best spent to make things better.

-- 
Jean Delvare
SUSE L3 Support



[Index of Archives]     [Linux GPIO]     [Linux SPI]     [Linux Hardward Monitoring]     [LM Sensors]     [Linux USB Devel]     [Linux Media]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux