Re: [RFC PATCH] ALSA: compress_offload: introduce passthrough operation mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



On 27. 05. 24 17:35, Pierre-Louis Bossart wrote:
Hi Jaroslav,
did you intend to reply privately?
I don't mind having this thread in public.

I also don't mind, it was just a mistake. Moving to public again. Thanks.


Couple of additional questions below.
Regards,
-Pierre

On 5/27/24 09:57, Jaroslav Kysela wrote:
On 27. 05. 24 16:17, Pierre-Louis Bossart wrote:
Thanks Jaroslav, this is very interesting indeed.
I added a set of comments to clarify the design.

+There is a requirement to expose the audio hardware that accelerates
various
+tasks for user space such as sample rate converters, compressed
+stream decoders, etc.

"passthrough" usually means 'no change to data, filter coefficients not
applied' in the audio world.

I am open to any better word here. The passthrough means data passing
through a hardware as fast as possible in this contents.

I am trying to find a better word.

+This is description for the API extension for the compress ALSA API
which
+is able to handle "tasks" that are not bound to real-time operations
+and allows for the serialization of operations.

not sure what "not bound to real-time operations" means. sample-rate
conversion is probably the most dependent on accurate timing :-)

The meaning is that the data are not queued/processed with a real world
timing constraints like standard audio data for standard playback or
capture. It's limited just by hardware speed. I would appreciate
rewording from a native English speaker, of course.

What happens if one bit of hardware is actually tied to an audio clock?
Would it prevent the use of the API?

No, the conversion will be just slow and the hardware driver should probably use a standard PCM for the data output or input. It will be more appropriate.

I think the only difference with traditional ALSA is that there's no
concept of xrun or time, but the hardware can implement the processing
however it wants.

There is no restriction for this design, but I think that we have already interfaces for this. This is a proposal for the audio acceleration API.

+Requirements
+============
+
+The main requirements are:
+
+- serialization of multiple tasks for user space to allow multiple
+  operations without user space intervention
+
+- separate buffers (input + output) for each operation

I guess we're talking about a "pipeline" where all the buffers are
shared/chained. Mixers and splitters are probably out-of-scope.

But I wonder if the last task could just consume data, or if the first
task could just generate data. The former case would be processing that
analyzes data and generates a 'score' or an event, and the latter would
be some sort of hardware synthesizer.

+
+- expose buffers using mmap to user space

If every buffer is mmap'ed to userspace, what prevents userspace from
interfering?

I think userspace would only be involved at the source and sink of the
processing chain, no?

We probably talk about same thing. Each task has own buffers which are
exposed to user space. The interfering is similar like for any other API.

What I meant if that if userspace is involved in the middle of a chain,
then it kind of breaks the idea that the processing is propagated as
fast as possible through hardware. I don't really see the point of
having intermediate buffers available to userspace.

There are no intermediate buffers in this API just input and output buffers yet. You probably have a bigger picture with the stream/data routing through multiple components. It's not the case (yet). The only advantage of existing dma-buf interface is that we can share those data among more drivers and user space simultaneously. The arbitration must be set elsewhere (and it is not set by this simple API).

+The API extension shares device enumeration and parameters handling
from
+the main compressed API. All other realtime streaming ioctls are
deactivated
+and a new set of task related ioctls are introduced. The standard
+read/write/mmap I/O operations are not supported in the passtrough
device.

The compress API was geared to encoders/decoders. I am not sure how we
would e.g. expose parameters for transcoders (decode-reencode) or even
SRCs?

I expect that the struct snd_codec structure will be modified (or usage
will be clarified) for transcoding.

What I meant is that the snd_codec took a lot of time to agree on, and
it was based on standards. I wonder if for a generic API the parameters
can be defined at all. It's probably hardware/vendor specific.

My goal is to start with something simple and decoding of the compressed stream to a PCM stream or ASRC use is really simple. Almost all parameters are already available in this structure, it's just about to settle the output buffer format.

+CREATE
+------
+Creates a set of input/output buffers. The input buffer size is
+fragment_size. Allocates unique seqno.
+
+The hardware drivers allocate internal 'struct dma_buf' for both
input and

for each input and output buffers?

Yes, for one task you need one input and one output buffer. All tasks
are separate (but serialized).

+STOP
+----
+Stop (dequeues) a task. If seqno is zero, operation is executed for all
+tasks.

Don't you need a DRAIN?

I would expect that transcoding will be fast or the application may use
smaller input buffers. We can add drain only when really required.

for a co-processor API, you would want all the input data to be consumed
and the stop happens when all the resulting data is provided in output
buffers.

And presumably when the input task is stopped, the state changes are
propagated to the next task by the framework? Or is userspace supposed
to track each and every task and change their state?

I also wonder if the state for a task should reflect that it's waiting
on data on its input, or conversely is blocked because the output
buffers were not consumed? Dealing with SRC, encoders or decoders mean
that the buffers are going to be used at vastly different rates on input
and outputs.

I would suggest to study the proposed sources. The allocated buffer
sizes may be different from the really used areas. The user space is
able to pass lower number of filled bytes than allocated. Also the
driver must ensure that the output buffer is large enough to store
result for the allocated input buffer size.

The START operation means that user space filled all input bytes for the
conversion.

I have a doubt here. Is the intent to fill a bunch of buffers, then wait
for the output to be ready?

Or can userspace write additional buffers while the output is being
created (similar to what we do for ALSA today).

Take the SRC for example. userspace provides 1024 samples, gets the
converted output. Now userspace provides a second 1024 sample buffer.
That second conversion needs to use the history of the first processing.
In other words, when a task is finished, it should not reset its
internal context, there is a history buffer that should only be cleared
if the input is a completely different track or content.

Yes, I would expect that the hardware driver will keep this state until first STOP command is received. STOP will create a discontinuation. Then the conversion will start again when more buffers are queued. User space can stop all tasks together, too.

+STATUS
+------
+Obtain the task status (active, finished). Also, the driver will set
+the real output data size (valid area in the output buffer).

Is this assuming that the entire input buffer has valid data?
There could be cases where the buffers are made of variable-length
'frames', it would be interesting to send such partial buffers to
hardware. That's always been a problem with the existing compressed API,
we couldn't deal with buffers that were partially filled.

I expect that the initial handshake (params/metadata) should set all
parameters for the driver to determine the right (maximal) output buffer
size.

I wasn't talking about max sizes, but really that each buffer passed by
userspace might contain different number of valid bytes. In ALSA, all
the bytes are assumed to follow each other, we can't tell than the
period N has 200 bytes and period N+1 has 106 bytes valid. That would be
a great addition if we can lose the concept of 'ring buffer' and instead
deal with independent buffers.

The proposed task works with independent buffers which may be reused to queue next data when the output (result) is consumed. The sizes of used areas in those buffers may be different for each task (variable data chunks).

					Jaroslav

--
Jaroslav Kysela <perex@xxxxxxxx>
Linux Sound Maintainer; ALSA Project; Red Hat, Inc.





[Index of Archives]     [Pulseaudio]     [Linux Audio Users]     [ALSA Devel]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]

  Powered by Linux