RE: [EXT] Re: Suspend/resume Issue on pcm_dmix.c in alsa-lib

Chancel Liu <chancel.liu@xxxxxxx> · Thu, 5 Sep 2024 11:01:11 +0000

> > > > Hi Takashi,
> > > >
> > > > Thanks for your reply and suggestions. Finally we have found the root
> cause.
> > > > Seems it's related to both drivers and alsa-lib.
> > > >
> > > > When two dmix clients run in parallel we get two direct dmix instances.
> > > > 1st dmix instance:
> > > > snd_pcm_dmix_open()
> > > >       snd_pcm_direct_initialize_slave()
> > > >               save_slave_setting()
> > > > Since the driver we are using has SND_PCM_INFO_RESUME flag,
> > > > dmix->spcm->info has this flag. Then this flag is cleared in
> > > dmix->shmptr->s.info.
> > > >
> > > > 2nd dmix instance:
> > > > snd_pcm_dmix_open()
> > > >       snd_pcm_direct_open_secondary_client()
> > > >               copy_slave_setting()
> > > > 2nd dmix->spcm->info is copied from dmix->shmptr->s.info so it doesn'
> > > > has this flag.
> > > >
> > > > If 1st dmix instance resumes firstly it should implement recovery of
> > > > slave pcm in snd_pcm_direct_slave_recover(). Because 1st
> > > > dmix->spcm->info has
> > > > SND_PCM_INFO_RESUME，snd_pcm_resume(direct->spcm) can be called
> > > > correctly to resume slave pcm.
> > >
> > > ... and immediately stop the stream, then prepare and restart as a usual
> > > restart.
> > >
> > > > However if 2nd dmix instance resumes firstly,
> > > > snd_pcm_resume(direct->spcm) will not be called because it's
> > > > spcm->info doesn't has SND_PCM_INFO_RESUME flag. The 1st dmix
> instance
> > > > assumes someone else already did recovery so
> > > > snd_pcm_resume(direct->spcm) won't be called neither. In result the
> > > > slave pcm fails to resume.
> > >
> > > Something wrong happening here, then.
> > >
> > > In dmix, there is no hardware resume at all, but it's always a restart of the
> > > stream.  The call of snd_pcm_resume() is only temporarily for
> inconsistencies
> > > that can be a problem on some drivers (IIRC dmaengine stuff).  That said,
> > > dmix does a kind of fake resume, stops and restarts the stream cleanly on
> the
> > > first instance.  On the second instance, it's already recovered, hence it
> bails
> > > out.
> > >
> > > If poll() hangs on the second instance, there can be some other problem.
> > > Maybe the resume -> stop -> restart sequence doesn't work with your
> driver
> > > well?
> > >
> >
> > Our dma driver will do PAUSE in system suspend and requires doing RESUME
> in
> > system resume. Current problem is that snd_pcm_resume() is not called by
> both
> > 1st instance and 2nd instance.
> 
> That's weird.  Are you really testing with the latest alsa-lib code?
> 
> If application doesn't call snd_pcm_resume(), it means that the PCM
> state isn't set to SUSPENDED, so it pretends as if still running.
> 
> Or if you mean that snd_pcm_resume() to the slave PCM isn't called
> (even though snd_pcm_resume() is called for the dmix PCM), check
> whether snd_pcm_direct_slave_recover() gets called, especially at the
> point:
> 
>         /* some buggy drivers require the device resumed before prepared;
>          * when a device has RESUME flag and is in SUSPENDED state,
> resume
>          * here but immediately drop to bring it to a sane active state.
>          */
>         if (state == SND_PCM_STATE_SUSPENDED &&
>             (direct->spcm->info & SND_PCM_INFO_RESUME)) {
>                 snd_pcm_resume(direct->spcm);
>                 snd_pcm_drop(direct->spcm);
>                 snd_pcm_direct_timer_stop(direct);
>                 snd_pcm_direct_clear_timer_queue(direct);
>         }
> 
> Try to put debug prints or catch via breakpoint whether this code path
> is executed.
> 
> Also, does the issue happen with the latest 6.11-rc kernel, too?
> If yes, what if you drop SNDRV_PCM_INFO_RESUME bit flag in the driver
> side?  Does the problem persist, or it works?
> 

I'm working on kernel 6.6 and alsa-lib v1.2.11. It's not so outdated I think and
then I will try to switch on the latest version.

Indeed I did some debug on this part. Please see my comments inline.

int snd_pcm_direct_slave_recover(snd_pcm_direct_t *direct)
{
	...

	/* [Chancel]
	 * When two dmix clients run in parallel we get two direct dmix instances.
	 * 1st dmix->spcm->info has SND_PCM_INFO_RESUME flag but 2nd dmix doesn't.
	 * Let's name 1st opened dmix "dmix1" and 2nd dmix "dmix2".
	 * After resume, both dmix1 and dmix2 enter into snd_pcm_direct_slave_recover().
	 * Here we assume dmix2 is the earlier instance which execute here.
	 * dmix2 successfully get semaphore lock and dmix1 is waiting for this lock.
	 */

	semerr = snd_pcm_direct_semaphore_down(direct,
					   DIRECT_IPC_SEM_CLIENT);
	...
	state = snd_pcm_state(direct->spcm);
	if (state != SND_PCM_STATE_XRUN && state != SND_PCM_STATE_SUSPENDED) {

	/* [Chancel]
	 * dmix2 finds spcm state is SUSPENDED so it will not enter here.
	 * However later when dmix1 get lock and enter here, spcm state has been changed to RUNNING by dmix2.
	 * In result dmix1 assumes some other instance has done so dmix2 directly return.
	 * snd_pcm_resume() is not called by dmix1.
	 */

		/* ignore... someone else already did recovery */
		semerr = snd_pcm_direct_semaphore_up(direct,
						     DIRECT_IPC_SEM_CLIENT);
		if (semerr < 0) {
			SNDERR("SEMUP FAILED with err %d", semerr);
			return semerr;
		}

		return 0;
	}
	...

	if (state == SND_PCM_STATE_SUSPENDED &&
	    (direct->spcm->info & SND_PCM_INFO_RESUME)) {

	/* [Chancel]
	 * dmix2->spcm->info doesn't have SND_PCM_INFO_RESUME flag. So this condition is not met.
	 * snd_pcm_resume() is not called by dmix2.
	 */

		snd_pcm_resume(direct->spcm);
		snd_pcm_drop(direct->spcm);
		snd_pcm_direct_timer_stop(direct);
		snd_pcm_direct_clear_timer_queue(direct);
	}
	...
	ret = snd_pcm_prepare(direct->spcm);
	...

	/* [Chancel]
	 * dmix2 calls snd_pcm_start to set spcm state to RUNNING.
	 */

	ret = snd_pcm_start(direct->spcm);
	...
}

The dma driver I'm using supports pause/resume function. I don't think dropping SNDRV_PCM_INFO_RESUME 
is a good fix on this issue. Besides this driver, I also validate on another driver whose dma doesn't
has such flag. This issue has gone and both 2 instances work well with suspend/resume.

Regards, 
Chancel Liu

> > > > SND_PCM_INFO_RESUME flag has impact on the flow of dmix resume. In
> my
> > > > opinion the first resumed dmix instance should make sure slave pcm can
> > > > be recovered properly no matter it's the first opened instance or
> > > > secondary opened instance
> > > .
> > >
> > > The snd_pcm_resume() gets called no matter which instance, just the first
> one
> > > who tries to recover the suspended state.  (And it's called internally at
> > > updating the various state, not necessarily an explicit recovery call.)
> > >
> >
> > Unfortunately if secondary opened instance resumes first it doesn't has
> > SND_PCM_INFO_RESUME which causes snd_pcm_resume() never be called.
> 
> No, it's misunderstanding.  SND_PCM_INFO_RESUME isn't exposed to the
> application in the case of dmix at all; i.e. dmix doesn't support the
> full resume, per se. That's the design.  So it doesn't matter which
> instance gets resumed at first.
> 
> > > > Do you know why the secondary opened instance clear the
> > > > SND_PCM_INFO_RESUME flag? Can we do the following modification?
> > > >
> > > > diff --git a/src/pcm/pcm_direct.c b/src/pcm/pcm_direct.c @@ -1183,8
> > > > +1226,6 @@ static void save_slave_setting(snd_pcm_direct_t *dmix,
> > > snd_pcm_t *spcm)
> > > >         COPY_SLAVE(buffer_time);
> > > >         COPY_SLAVE(sample_bits);
> > > >         COPY_SLAVE(frame_bits);
> > > > -
> > > > -       dmix->shmptr->s.info &= ~SND_PCM_INFO_RESUME;
> > >
> > > I don't think so.  The clearance of the RESUME flag here is correct.
> > > dmix doesn't support the hardware resume feature.  It does its own.
> > > (And this flag is merely a info for apps, which isn't really evaluated except
> for
> > > the code in dmix workaround there.)
> > >
> > >
> > > Takashi
> > >
> >
> > I think dmix should know what state the real driver is. If driver requires that
> > app should do snd_pcm_resume() how can dmix get this information?
> 
> The dmix already knows.  But the PCM state exposed to applications
> isn't always tied as 1:1.
> 
> 
> Takashi