Re: [EXT] Re: Suspend/resume Issue on pcm_dmix.c in alsa-lib

Takashi Iwai <tiwai@xxxxxxx> · Fri, 06 Sep 2024 08:31:06 +0200

On Fri, 06 Sep 2024 08:22:23 +0200,
Chancel Liu wrote:
> 
> > > > > > > Hi Takashi,
> > > > > > >
> > > > > > > Thanks for your reply and suggestions. Finally we have found
> > > > > > > the root
> > > > cause.
> > > > > > > Seems it's related to both drivers and alsa-lib.
> > > > > > >
> > > > > > > When two dmix clients run in parallel we get two direct dmix
> > instances.
> > > > > > > 1st dmix instance:
> > > > > > > snd_pcm_dmix_open()
> > > > > > >       snd_pcm_direct_initialize_slave()
> > > > > > >               save_slave_setting() Since the driver we are
> > > > > > > using has SND_PCM_INFO_RESUME flag,
> > > > > > > dmix->spcm->info has this flag. Then this flag is cleared in
> > > > > > dmix->shmptr->s.info.
> > > > > > >
> > > > > > > 2nd dmix instance:
> > > > > > > snd_pcm_dmix_open()
> > > > > > >       snd_pcm_direct_open_secondary_client()
> > > > > > >               copy_slave_setting() 2nd dmix->spcm->info is
> > > > > > > copied from dmix->shmptr->s.info so it doesn'
> > > > > > > has this flag.
> > > > > > >
> > > > > > > If 1st dmix instance resumes firstly it should implement
> > > > > > > recovery of slave pcm in snd_pcm_direct_slave_recover().
> > > > > > > Because 1st
> > > > > > > dmix->spcm->info has
> > > > > > > SND_PCM_INFO_RESUME，snd_pcm_resume(direct->spcm) can be
> > called
> > > > > > > correctly to resume slave pcm.
> > > > > >
> > > > > > ... and immediately stop the stream, then prepare and restart as
> > > > > > a usual restart.
> > > > > >
> > > > > > > However if 2nd dmix instance resumes firstly,
> > > > > > > snd_pcm_resume(direct->spcm) will not be called because it's
> > > > > > > spcm->info doesn't has SND_PCM_INFO_RESUME flag. The 1st dmix
> > > > instance
> > > > > > > assumes someone else already did recovery so
> > > > > > > snd_pcm_resume(direct->spcm) won't be called neither. In
> > > > > > > result the slave pcm fails to resume.
> > > > > >
> > > > > > Something wrong happening here, then.
> > > > > >
> > > > > > In dmix, there is no hardware resume at all, but it's always a
> > > > > > restart of the stream.  The call of snd_pcm_resume() is only
> > > > > > temporarily for
> > > > inconsistencies
> > > > > > that can be a problem on some drivers (IIRC dmaengine stuff).
> > > > > > That said, dmix does a kind of fake resume, stops and restarts
> > > > > > the stream cleanly on
> > > > the
> > > > > > first instance.  On the second instance, it's already recovered,
> > > > > > hence it
> > > > bails
> > > > > > out.
> > > > > >
> > > > > > If poll() hangs on the second instance, there can be some other problem.
> > > > > > Maybe the resume -> stop -> restart sequence doesn't work with
> > > > > > your
> > > > driver
> > > > > > well?
> > > > > >
> > > > >
> > > > > Our dma driver will do PAUSE in system suspend and requires doing
> > > > > RESUME
> > > > in
> > > > > system resume. Current problem is that snd_pcm_resume() is not
> > > > > called by
> > > > both
> > > > > 1st instance and 2nd instance.
> > > >
> > > > That's weird.  Are you really testing with the latest alsa-lib code?
> > > >
> > > > If application doesn't call snd_pcm_resume(), it means that the PCM
> > > > state isn't set to SUSPENDED, so it pretends as if still running.
> > > >
> > > > Or if you mean that snd_pcm_resume() to the slave PCM isn't called
> > > > (even though snd_pcm_resume() is called for the dmix PCM), check
> > > > whether snd_pcm_direct_slave_recover() gets called, especially at
> > > > the
> > > > point:
> > > >
> > > >         /* some buggy drivers require the device resumed before
> > prepared;
> > > >          * when a device has RESUME flag and is in SUSPENDED state,
> > > > resume
> > > >          * here but immediately drop to bring it to a sane active state.
> > > >          */
> > > >         if (state == SND_PCM_STATE_SUSPENDED &&
> > > >             (direct->spcm->info & SND_PCM_INFO_RESUME)) {
> > > >                 snd_pcm_resume(direct->spcm);
> > > >                 snd_pcm_drop(direct->spcm);
> > > >                 snd_pcm_direct_timer_stop(direct);
> > > >                 snd_pcm_direct_clear_timer_queue(direct);
> > > >         }
> > > >
> > > > Try to put debug prints or catch via breakpoint whether this code
> > > > path is executed.
> > > >
> > > > Also, does the issue happen with the latest 6.11-rc kernel, too?
> > > > If yes, what if you drop SNDRV_PCM_INFO_RESUME bit flag in the
> > > > driver side?  Does the problem persist, or it works?
> > > >
> > >
> > > I'm working on kernel 6.6 and alsa-lib v1.2.11. It's not so outdated I
> > > think and then I will try to switch on the latest version.
> > >
> > > Indeed I did some debug on this part. Please see my comments inline.
> > >
> > > int snd_pcm_direct_slave_recover(snd_pcm_direct_t *direct) {
> > >       ...
> > >
> > >       /* [Chancel]
> > >        * When two dmix clients run in parallel we get two direct dmix
> > instances.
> > >        * 1st dmix->spcm->info has SND_PCM_INFO_RESUME flag but 2nd
> > dmix doesn't.
> > 
> > OK, that must be the cause.  It's because the second open copies the saved
> > shmem->s.info into spcm->info at its open time while we already dropped the
> > INFO_RESUME bit.  All the rest behavior are side effect of this inconsistency.
> > 
> > I guess dropping the INFO_RESUME bit at hw_params and hw_refine should
> > work instead.  A totally untested fix is below.
> > 
> > (And I believe the drop of INFO_PAUSE should be handled similarly,  too,
> > instead of dropping spcm->info bit there.)
> > 
> > 
> > Takashi
> > 
> > --- a/src/pcm/pcm_direct.c
> > +++ b/src/pcm/pcm_direct.c
> > @@ -1018,6 +1018,7 @@ int snd_pcm_direct_hw_refine(snd_pcm_t *pcm,
> > snd_pcm_hw_params_t *params)
> >         }
> >         dshare->timer_ticks = hw_param_interval(params,
> > SND_PCM_HW_PARAM_PERIOD_SIZE)->max / dshare->slave_period_size;
> >         params->info = dshare->shmptr->s.info;
> > +       params->info &= ~SND_PCM_INFO_RESUME;
> >  #ifdef REFINE_DEBUG
> >         snd_output_puts(log, "DMIX REFINE (end):\n");
> >         snd_pcm_hw_params_dump(params, log); @@ -1031,6 +1032,7
> > @@ int snd_pcm_direct_hw_params(snd_pcm_t *pcm,
> > snd_pcm_hw_params_t * params)
> >         snd_pcm_direct_t *dmix = pcm->private_data;
> > 
> >         params->info = dmix->shmptr->s.info;
> > +       params->info &= ~SND_PCM_INFO_RESUME;
> >         params->rate_num = dmix->shmptr->s.rate;
> >         params->rate_den = 1;
> >         params->fifo_size = 0;
> > @@ -1183,8 +1185,6 @@ static void save_slave_setting(snd_pcm_direct_t
> > *dmix, snd_pcm_t *spcm)
> >         COPY_SLAVE(buffer_time);
> >         COPY_SLAVE(sample_bits);
> >         COPY_SLAVE(frame_bits);
> > -
> > -       dmix->shmptr->s.info &= ~SND_PCM_INFO_RESUME;
> >  }
> > 
> >  #undef COPY_SLAVE
> 
> Thanks Takashi,
> 
> This patch can fix this issue on my side. From my test both dmix1->spcm->info and
> dmix2->spcm->info has SND_PCM_INFO_RESUME flag and snd_pcm_resume() can be
> successfully called by first resumed instance. I don't understand this patch well. Are
> you meant to drop SND_PCM_INFO_RESUME from dmix and keep it in slave pcm?

Yes.  The intention of dropping INFO_RESUME is because dmix can't do
the full resume due to its implementation nature.  It needs a prepare
/ restart like many other drivers.  So we have to drop the info bit
exposed to the outside for apps, while keeping the slave PCM info
internally intact.

> BTW, when will this patch merged to mainline?

Now the test result is positive, I'm going to submit & merge later.

thanks,

Takashi