Re: [RFCv1 PATCH 00/32] Core and vb2 enhancements

Hans de Goede <hdegoede@xxxxxxxxxx> · Tue, 12 Jun 2012 15:24:18 +0200

Hi,

On 06/12/2012 01:35 PM, Mauro Carvalho Chehab wrote:
Em 10-06-2012 16:27, Hans Verkuil escreveu:
On Sun June 10 2012 19:32:36 Hans Verkuil wrote:
On Sun June 10 2012 18:46:52 Mauro Carvalho Chehab wrote:
3) it would be interesting if you could benchmark the previous code and the new
one, to see what gains this change introduced, in terms of v4l2-core footprint and
performance.

I'll try that, should be interesting. Actually, my prediction is that I won't notice any
difference. Todays CPUs are so fast that the overhead of the switch is probably hard to
measure.

I did some tests, calling various ioctls 100,000,000 times. The actual call into the
driver was disabled so that I only measure the time spent in v4l2-ioctl.c.

I ran the test program with 'time ./t' and measured the sys time.

For each ioctl I tested 5 times and averaged the results. Times are in seconds.

					Old		New
QUERYCAP			24.86	24.37
UNSUBSCRIBE_EVENT	23.40	23.10
LOG_STATUS			18.84	18.76
ENUMINPUT			28.82	28.90

Particularly for QUERYCAP and UNSUBSCRIBE_EVENT I found a small but reproducible
improvement in speed. The results for LOG_STATUS and ENUMINPUT are too close to
call.

After looking at the assembly code that the old code produces I suspect (but it
is hard to be sure) that LOG_STATUS and ENUMINPUT are tested quite early on, whereas
QUERYCAP and UNSUBSCRIBE_EVENT are tested quite late. The order in which the compiler
tests definitely has no relationship with the order of the case statements in the
switch.

The ioctl's are reordered, as gcc optimizes them in order to do a tree search and to avoid
cache flush. The worse case is likely converted into 7 CMP asm calls (log2(128)).

On your code, gcc may not be able to predict the JMP's, so it may actually have cache flushes,
depending on the cache size, and if the caller functions are before of after the video_ioctl2
handler.

I suspect that, if you compare the code with debug enabled, the new code can actually be worse
than the previous one.

It would be good if you could test what happens with QBUF/DQBUF.

This would certainly explain what I am seeing. I'm actually a bit surprised that
this is measurable at all.

The timing difference is not significant, especially because those ioctl's aren't the ones
used inside the streaming loop. The only ioctl's that are more time-sensitive are the streaming
ones, especially QBUF/DQBUF.

Even QBUF / DQBUF are called max circa 100 times / second. I think Hans V's patchset should not
be seen from a performance pov (other then that it should not cause performance regressions), but
more as a nice code cleanup / simplification.

It certainly makes things a lot more readable by avoiding a lot of code duplication. Not sure if
in the end it actually saves any lines of code, but readability, and being able to understand the
intent of the code is key here IMHO.

Regards,

Hans (who likes Hans V's patchset :)

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html