[Bug 1814682] Review Request: rshim - rshim driver for Mellanox BlueField SoC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.redhat.com/show_bug.cgi?id=1814682



--- Comment #36 from Alaa Hleihel (Mellanox) <ahleihel@xxxxxxxxxx> ---
Hi,

I logged in to the system and found the following issues:

################################################################

1. rshim service start fails:

Apr 12 02:39:06 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
rshim[4799]: Probing pcie-01:00.2
Apr 12 02:39:06 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
rshim[4799]: create rshim pcie-01:00.2
Apr 12 02:39:06 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
rshim[4799]: Failed to map RShim registers

[root@qualcomm-amberwing-rep2-01 ~]# rshim -f
modprobe: FATAL: Module cuse not found in directory
/lib/modules/4.18.0-147.el8.aarch64
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Probing pcie-01:00.2
create rshim pcie-01:00.2
Failed to map RShim registers


The reason that a required module is not installed on the system:
[root@qualcomm-amberwing-rep2-01 ~]# modinfo cuse
modinfo: ERROR: Module cuse not found.


The fix is: 
# dnf install -y kernel-modules-extra

Then the module will be available:
[root@qualcomm-amberwing-rep2-01 ~]# modinfo cuse
filename:       /lib/modules/4.18.0-147.el8.aarch64/kernel/fs/fuse/cuse.ko.xz


################################################################

2. rshim service stop fails:
Apr 12 02:35:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: Stopping rshim driver for BlueField SoC...
Apr 12 02:35:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[4383]: rshim.service: Failed to execute command: No such file or
directory
Apr 12 02:35:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[4383]: rshim.service: Failed at step EXEC spawning /usr/bin/killall: No
such file or directory
                                                                               
                   
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 12 02:35:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: rshim.service: Control process exited, code=exited status=203
Apr 12 02:36:55 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
sshd[4384]: Connection closed by 10.35.206.44 port 60160 [preauth]
Apr 12 02:36:59 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
sshd[4386]: Accepted password for root from 10.35.206.44 port 60162 ssh2
Apr 12 02:36:59 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd-logind[1469]: New session 5 of user root.
Apr 12 02:36:59 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: Started Session 5 of user root.
Apr 12 02:36:59 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
sshd[4386]: pam_unix(sshd:session): session opened for user root by (uid=0)
Apr 12 02:37:27 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: rshim.service: State 'stop-sigterm' timed out. Killing.
Apr 12 02:37:27 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: rshim.service: Killing process 4363 (rshim) with signal SIGKILL.
Apr 12 02:37:27 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: rshim.service: Failed with result 'exit-code'.
Apr 12 02:37:27 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: Stopped rshim driver for BlueField SoC.


The fix is: 
# dnf install -y psmisc

################################################################

3. Even after fixing the above, we still fail to load everything:

[root@qualcomm-amberwing-rep2-01 ~]# rshim  -f
Probing pcie-01:00.2
create rshim pcie-01:00.2
Failed to map RShim registers


>From strace on "rshim -f":

write(1, "Probing pcie-01:00.2\n", 21Probing pcie-01:00.2
)  = 21
write(1, "create rshim pcie-01:00.2\n", 26create rshim pcie-01:00.2
) = 26
openat(AT_FDCWD, "/dev/mem", O_RDWR|O_SYNC) = -1 ENOENT (No such file or
directory)
                 ^^^^^^^^^^                   ^^^^^^^^^^^^^^
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_SHARED, -1, 0x80100300000) = -1
EBADF (Bad file descriptor)
write(1, "Failed to map RShim registers\n", 30Failed to map RShim registers
) = 30

That's because CONFIG_DEVMEM is not enabled in the kernel:

[root@qualcomm-amberwing-rep2-01 ~]# grep CONFIG_DEVMEM
/boot/config-4.18.0-147.el8.aarch64 
# CONFIG_DEVMEM is not set

--> Note; I see that this config is disabled only on aarch64 in RHEL-8.
I created a kernel with this config enabled, and then it worked.

[root@qualcomm-amberwing-rep2-01 ~]# ls -l /dev/mem
crw-r-----. 1 root kmem 1, 1 Apr 12  2020 /dev/mem
[root@qualcomm-amberwing-rep2-01 ~]# systemctl start rshim
[root@qualcomm-amberwing-rep2-01 ~]# systemctl status rshim
● rshim.service - rshim driver for BlueField SoC
   Loaded: loaded (/usr/lib/systemd/system/rshim.service; disabled; vendor
preset: disabled)
   Active: active (running) since Sun 2020-04-12 05:36:57 EDT; 4s ago
     Docs: man:rshim(8)
  Process: 5783 ExecStart=/usr/sbin/rshim $OPTIONS (code=exited,
status=0/SUCCESS)
 Main PID: 5784 (rshim)
    Tasks: 6 (limit: 37682)
   Memory: 32.5M
   CGroup: /system.slice/rshim.service
           └─5784 /usr/sbin/rshim

Apr 12 05:36:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: Starting rshim driver for BlueField SoC...
Apr 12 05:36:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
systemd[1]: Started rshim driver for BlueField SoC.
Apr 12 05:36:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
rshim[5784]: Probing pcie-01:00.2
Apr 12 05:36:57 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
rshim[5784]: create rshim pcie-01:00.2
Apr 12 05:36:58 qualcomm-amberwing-rep2-01.khw4.lab.eng.bos.redhat.com
rshim[5784]: rshim0 attached
[root@qualcomm-amberwing-rep2-01 ~]# ls -l /dev/rshim*
total 0
crw-------. 1 root root 241, 0 Apr 12 05:36 boot
crw-------. 1 root root 240, 0 Apr 12 05:36 console
crw-------. 1 root root 239, 0 Apr 12 05:36 misc
crw-------. 1 root root 242, 0 Apr 12 05:36 rshim
[root@qualcomm-amberwing-rep2-01 ~]# 


################################################################

4. Even after fixing all previous issues, accessing the device always hangs.
E.g either of these will hang:
# cat  /dev/rshim0/misc
# sudo minicom --color on --baudrate 115200 --device /dev/rshim0/console

And dmesg will show something like this:

Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: INFO: task cat:6591 blocked
for more than 60 seconds.
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel:      Not tainted 4.18.0 #1
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: cat             D    0  6591
  6316 0x00000201
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: Call trace:
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: __switch_to+0x6c/0x90
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: __schedule+0x270/0x8a8
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: schedule+0x30/0x78
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel:
request_wait_answer+0x144/0x260 [fuse]
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel:
__fuse_request_send+0xac/0xd0 [fuse]
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: fuse_request_send+0x58/0x68
[fuse]
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: fuse_direct_io+0x358/0x5a0
[fuse]
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: cuse_read_iter+0x78/0xa0
[cuse]
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: new_sync_read+0x108/0x158
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: __vfs_read+0x74/0x90
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: vfs_read+0x98/0x150
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: ksys_read+0x6c/0xd0
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: __arm64_sys_read+0x24/0x30
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: el0_svc_handler+0xa0/0x128
Apr 12 05:46:41 qualcomm-amberwing-rep2-01 kernel: el0_svc+0x8/0xc



Current BlueField version used was:

Mellanox BlueField A0 BL1 V1.0
NOTICE:  BL2: v1.5(release):BL2.2
NOTICE:  BL2: Built : 15:58:07, Jul 25 2019
NOTICE:  BL2 built for hw (ver 0)
NOTICE:  Running as MBF1M332A-AS system
NOTICE:  Initializing DDR at mss[0]=0x18000000
NOTICE:  No SPD detected on MSS0 DIMM0
NOTICE:  No SPD detected on MSS0 DIMM1
NOTICE:  No memory present on MSS 0
NOTICE:  Doing MSS idle operations on MSS 0
NOTICE:  Initializing DDR at mss[1]=0x20000000
NOTICE:  No SPD detected on MSS1 DIMM0
NOTICE:  No SPD detected on MSS1 DIMM1
NOTICE:  Finished initializing DDR on MSS 1!
NOTICE:  DDR POST passed.
NOTICE:  BL1: Booting BL31
NOTICE:  BL31: v1.5(release):BL2.2
NOTICE:  BL31: Built : 15:58:07, Jul 25 2019
NOTICE:  BL31 built for hw (ver 0)
UEFI firmware (version BlueField:2.0-f399628 built at 15:59:48 on Jul 25 2019)


I've updated it to the latest BlueField-2.5.1.11213 (using kernel module rshim
version rshim-1.18-0.gb99e894_4.18.0.aarch64 from BlueField-2.5.1.11213
bundle):

Mellanox BlueField A0 BL1 V1.0
NOTICE:  Enabled watchdog (120 sec delay)
NOTICE:  Next boot will be in swap_emmc mode
NOTICE:  BL2: v1.5(release):2.5.1-0-gbe0dd6b
NOTICE:  BL2: Built : 23:42:29, Apr  2 2020
NOTICE:  BL2 built for hw (ver 0)
NOTICE:  Running as MBF1M332A-AS system
NOTICE:  Initializing DDR at mss[0]=0x18000000
NOTICE:  No SPD detected on MSS0 DIMM0
NOTICE:  No SPD detected on MSS0 DIMM1
NOTICE:  No memory present on MSS 0
NOTICE:  Doing MSS idle operations on MSS 0
NOTICE:  Initializing DDR at mss[1]=0x20000000
NOTICE:  No SPD detected on MSS1 DIMM0
NOTICE:  No SPD detected on MSS1 DIMM1
NOTICE:  Finished initializing DDR on MSS 1!
NOTICE:  DDR POST passed.
NOTICE:  BL1: Booting BL31
NOTICE:  BL31: v1.5(release):2.5.1-0-gbe0dd6b
NOTICE:  BL31: Built : 23:42:29, Apr  2 2020
NOTICE:  BL31 built for hw (ver 0)
UEFI firmware (version BlueField:2.5.1-0-ga9be8ec built at 23:43:44 on Apr  2
2020)


But it still hangs, will continue checking...


-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are always notified about changes to this product and component
_______________________________________________
package-review mailing list -- package-review@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to package-review-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/package-review@xxxxxxxxxxxxxxxxxxxxxxx




[Index of Archives]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite Conditions]     [KDE Users]

  Powered by Linux