On Wed 2022-05-11 16:33:57, Song Liu wrote: > > > > On May 11, 2022, at 2:24 AM, Petr Mladek <pmladek@xxxxxxxx> wrote: > > > > On Tue 2022-05-10 17:33:31, Josh Poimboeuf wrote: > >> On Tue, May 10, 2022 at 11:57:04PM +0000, Song Liu wrote: > >>>> If it's a real bug, we should fix it everywhere, not just for Facebook. > >>>> Otherwise CONFIG_PREEMPT and/or non-x86 arches become second-class > >>>> citizens. > >>> > >>> I think "is it a real bug?" is the top question for me. So maybe we > >>> should take a step back. > >>> > >>> The behavior we see is: A busy kernel thread blocks klp transition > >>> for more than a minute. But the transition eventually succeeded after > >>> < 10 retries on most systems. The kernel thread is well-behaved, as > >>> it calls cond_resched() at a reasonable frequency, so this is not a > >>> deadlock. > >>> > >>> If I understand Petr correctly, this behavior is expected, and thus > >>> is not a bug or issue for the livepatch subsystem. This is different > >>> to our original expectation, but if this is what we agree on, we > >>> will look into ways to incorporate long wait time for patch > >>> transition in our automations. > >> > >> That's how we've traditionally looked at it, though apparently Red Hat > >> and SUSE have implemented different ideas of what a long wait time is. > >> > >> In practice, one minute has always been enough for all of kpatch's users > >> -- AFAIK, everybody except SUSE -- up until now. > > > > I am actually surprised that nobody met the problem yet. There are > > "only" 60 attempts to transition the pending tasks. > > Maybe we should consider increase the frequency we try? Say to 10 times > per second? I guess this will solve most of the failures we are seeing > in current case. My concern is that klp_try_complete_transition() checks all processes under read_lock(&tasklist_lock). It might create some contention on this lock. I am not sure if this lock is fair. It might slow down block writers (creating/deleting tasks). Best Regards, Petr