Hi Alexander, On Wed, Feb 07, 2024 at 04:27:48PM -0500, Alexander Aring wrote: > Hi, > > On Wed, Feb 7, 2024 at 1:33 PM Jordan Rife <jrife@xxxxxxxxxx> wrote: > > > > On Wed, Feb 7, 2024 at 2:39 AM Salvatore Bonaccorso <carnil@xxxxxxxxxx> wrote: > > > > > > Hi Valentin, hi all > > > > > > [This is about a regression reported in Debian for 6.1.67] > > > > > > On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote: > > > > Package: linux-image-amd64 > > > > Version: 6.1.76+1 > > > > Source: linux > > > > Source-Version: 6.1.76+1 > > > > Severity: important > > > > Control: notfound -1 6.6.15-2 > > > > > > > > Dear Maintainers, > > > > > > > > We discovered a bug affecting dlm that prevents any tcp communications by > > > > dlm when booted with debian kernel 6.1.76-1. > > > > > > > > Dlm startup works (corosync-cpgtool shows the dlm:controld group with all > > > > expected nodes) but as soon as we try to add a lockspace dmesg shows: > > > > ``` > > > > dlm: Using TCP for communications > > > > dlm: cannot start dlm midcomms -97 > > > > ``` > > > > > > > > It seems that commit "dlm: use kernel_connect() and kernel_bind()" > > > > (e9cdebbe) was merged to 6.1. > > > > > > > > Checking the code it seems that the changed function dlm_tcp_listen_bind() > > > > fails with exit code 97 (EAFNOSUPPORT) > > > > It is called from > > > > > > > > dlm/lockspace.c: threads_start() -> dlm_midcomms_start() > > > > dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start() > > > > dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() -> > > > > dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind() > > > > > > > > The error code is returned all the way to threads_start() where the error > > > > message is emmitted. > > > > > > > > Booting with the unsigned kernel from testing (6.6.15-2), which also > > > > contains this commit, works without issues. > > > > > > > > I'm not sure what additional changes are required to get this working or if > > > > rolling back this change is an option. > > > > > > > > We'd be happy to test patches that might fix this issue. > > > > > > Thanks for your report. So we have a 6.1.76 specific regression for > > > the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and > > > kernel_bind()") . > > > > > > Let's loop in the upstream regression list for tracking and people > > > involved for the subsystem to see if the issue can be identified. As > > > it is working for 6.6.15 which includes the commit backport as well it > > > might be very well that a prerequisite is missing. > > > > > > # annotate regression with 6.1.y specific commit > > > #regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663 > > > #regzbot link: https://bugs.debian.org/1063338 > > > > > > Any ideas? > > > > > > Regards, > > > Salvatore > > > > > > Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1 > > and 6.6 stable branches, > > it looks like there is a mismatch here with the dlm_local_addr[0] parameter. > > > > 6.1 > > ---- > > > > static int dlm_tcp_listen_bind(struct socket *sock) > > { > > int addr_len; > > > > /* Bind to our port */ > > make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len); > > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0], > > addr_len); > > } > > > > 6.6 > > ---- > > static int dlm_tcp_listen_bind(struct socket *sock) > > { > > int addr_len; > > > > /* Bind to our port */ > > make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len); > > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0], > > addr_len); > > } > > > > 6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which > > changed > > > > static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT]; > > > > to > > > > static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT]; > > > > It looks like kernel_bind() in 6.1 needs to be modified to match. > > > > makes sense. I tried to cherry-pick e9cdebbe23f1 ("dlm: use > kernel_connect() and kernel_bind()") on v6.1.67 as I don't see it > there. It failed and does not apply cleanly. > > Are we talking here about a debian kernel specific backport? If so, > maybe somebody missed to modify those parts you mentioned. Thanks all for looking into it. No it's not a Debian specific backport, e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()") got in fact backported upstream in 6.1.76, 6.6.15 and 6.7.3. The respective commits are: v6.1.76: e11dea8f503341507018b60906c4a9e7332f3663 dlm: use kernel_connect() and kernel_bind() v6.6.15: c018ab3e31b16ff97b9b95b69904104c9fcca95b dlm: use kernel_connect() and kernel_bind() v6.7.3: 4ecf1864f2076872b7aea29d463e785ef6fc9909 dlm: use kernel_connect() and kernel_bind() v6.8-rc1: e9cdebbe23f1aa9a1caea169862f479ab3fa2773 dlm: use kernel_connect() and kernel_bind() But for the 6.1.76 case there is the above regression (while it works for 6.6.15 as confirmed by the reporter). I'm very sorry I see where I have caused you confusion: The regression is in 6.1.*76* not 6.1.*67* and I misstyped the version in two places. Regards, Salvatore