Re: [PATCH] v4l2-dev: Add tracepoints for QBUF and DQBUF

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Hans,

Hans Verkuil wrote:
> On 12/11/13 13:44, Mauro Carvalho Chehab wrote:
>> Em Wed, 11 Dec 2013 12:53:55 +0100
>> Hans Verkuil <hverkuil@xxxxxxxxx> escreveu:
>>
>>> Hi Mauro,
>>>
>>> On 12/11/13 11:44, Mauro Carvalho Chehab wrote:
>>>> Hans,
>>>>
>>>> Em Wed, 11 Dec 2013 08:29:58 +0100
>>>> Hans Verkuil <hverkuil@xxxxxxxxx> escreveu:
>>>>
>>>>> On 12/10/2013 09:53 PM, Mauro Carvalho Chehab wrote:
>>>>>> Em Sat, 23 Nov 2013 17:54:48 +0100
>>>>>> Hans Verkuil <hverkuil@xxxxxxxxx> escreveu:
>>>>>>
>>>>>>> On 11/23/2013 05:30 PM, Sylwester Nawrocki wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 11/23/2013 12:25 PM, Hans Verkuil wrote:
>>>>>>>>> Hi Wade,
>>>>>>>>>
>>>>>>>>> On 11/22/2013 08:48 PM, Wade Farnsworth wrote:
>>>>>>>>>> Add tracepoints to the QBUF and DQBUF ioctls to enable rudimentary
>>>>>>>>>> performance measurements using standard kernel tracers.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Wade Farnsworth<wade_farnsworth@xxxxxxxxxx>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> This is the update to the RFC patch I posted a few weeks back.  I've added
>>>>>>>>>> several bits of metadata to the tracepoint output per Mauro's suggestion.
>>>>>>>>>
>>>>>>>>> I don't like this. All v4l2 ioctls can already be traced by doing e.g.
>>>>>>>>> echo 1 (or echo 2)>/sys/class/video4linux/video0/debug.
>>>>>>>>>
>>>>>>>>> So this code basically duplicates that functionality. It would be nice to be able
>>>>>>>>> to tie in the existing tracing code (v4l2-ioctl.c) into tracepoints.
>>>>>>>>
>>>>>>>> I think it would be really nice to have this kind of support for standard
>>>>>>>> traces at the v4l2 subsystem. Presumably it could even gradually replace
>>>>>>>> the v4l2 custom debug infrastructure.
>>>>>>>>
>>>>>>>> If I understand things correctly, the current tracing/profiling 
>>>>>>>> infrastructure
>>>>>>>> is much less invasive than inserting printks all over, which may cause 
>>>>>>>> changes
>>>>>>>> in control flow. I doubt the system could be reliably profiled by 
>>>>>>>> enabling all
>>>>>>>> those debug prints.
>>>>>>>>
>>>>>>>> So my vote would be to add support for standard tracers, like in other
>>>>>>>> subsystems in the kernel.
>>>>>>>
>>>>>>> The reason for the current system is to trace which ioctls are called in
>>>>>>> what order by a misbehaving application. It's very useful for that,
>>>>>>> especially when trying to debug user problems.
>>>>>>>
>>>>>>> I don't mind switching to tracepoints as long as this functionality is
>>>>>>> kept one way or another.
>>>>>>
>>>>>> I agree with Sylwester: we should move to tracepoints, and this is a good
>>>>>> start.
>>>>>
>>>>> As I mentioned, I don't mind switching to tracepoints, but not in the way the
>>>>> current patch does it. I certainly don't agree with you merging this patch
>>>>> as-is without further discussion.
>>>>
>>>> Let's not mix the subjects here. There are two different demands that can be
>>>> either fulfilled by the same solution or by different ones:
>>>> 	1) To have a way to enable debugging all parameters passed from/to
>>>> userspace;
>>>> 	2) To be able to measure the driver (and core) performance while
>>>> streaming.
>>>>
>>>> This patch's scope is to solve (2), by allowing userspace to hook the two
>>>> ioctls that queues/dequeues streaming buffers.
>>>>
>>>> With that regards, what's wrong on this patch? I couldn't see anything bad
>>>> there. It is not meant to solve (1).
>>>
>>> I have two problems with it: 
>>>
>>> 1) Right now it is just checking for two ioctls, but as soon as we add more
>>> tracepoints you get another switch on the ioctl command, something I've tried
>>> to avoid by creating the table. So a table-oriented implementation would be
>>> much preferred.
>>
>> From performance measurement PoV, I don't see any reason to add it to any
>> other ioctl.
> 
> But perhaps this makes sense as well for read() and write()? Possibly poll()?
> 
>> Of course if we revisit it and other traces become needed,
>> then we should use a table approach instead.
>>
>>>
>>> 2) It duplicates the already existing code to dump the v4l2_buffer struct.
>>> Is there a way to have just one copy of that? I was hoping Wade could look
>>> into that, and if not, then it is something I can explore (although finding
>>> time is an issue).
>>
>> Having implemented tracepoints myself, I'd say that the answer is no.
>>
>> Let me give you an example.
>>
>> The ras:aer_event trace is a simple tracepoint that also uses the 
>> __print_flags() macro. It is implemented at include/trace/events/ras.h
>> as:
>>
>> #define aer_correctable_errors		\
>> 	{BIT(0),	"Receiver Error"},		\
>> 	{BIT(6),	"Bad TLP"},			\
>> 	{BIT(7),	"Bad DLLP"},			\
>> 	{BIT(8),	"RELAY_NUM Rollover"},		\
>> 	{BIT(12),	"Replay Timer Timeout"},	\
>> 	{BIT(13),	"Advisory Non-Fatal"}
>>
>> #define aer_uncorrectable_errors		\
>> 	{BIT(4),	"Data Link Protocol"},		\
>> 	{BIT(12),	"Poisoned TLP"},		\
>> 	{BIT(13),	"Flow Control Protocol"},	\
>> 	{BIT(14),	"Completion Timeout"},		\
>> 	{BIT(15),	"Completer Abort"},		\
>> 	{BIT(16),	"Unexpected Completion"},	\
>> 	{BIT(17),	"Receiver Overflow"},		\
>> 	{BIT(18),	"Malformed TLP"},		\
>> 	{BIT(19),	"ECRC"},			\
>> 	{BIT(20),	"Unsupported Request"}
>>
>> TRACE_EVENT(aer_event,
>> 	TP_PROTO(const char *dev_name,
>> 		 const u32 status,
>> 		 const u8 severity),
>>
>> 	TP_ARGS(dev_name, status, severity),
>>
>> 	TP_STRUCT__entry(
>> 		__string(	dev_name,	dev_name	)
>> 		__field(	u32,		status		)
>> 		__field(	u8,		severity	)
>> 	),
>>
>> 	TP_fast_assign(
>> 		__assign_str(dev_name, dev_name);
>> 		__entry->status		= status;
>> 		__entry->severity	= severity;
>> 	),
>>
>> 	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
>> 		__get_str(dev_name),
>> 		__entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
>> 			__entry->severity == HW_EVENT_ERR_FATAL ?
>> 			"Fatal" : "Uncorrected",
>> 		__entry->severity == HW_EVENT_ERR_CORRECTED ?
>> 		__print_flags(__entry->status, "|", aer_correctable_errors) :
>> 		__print_flags(__entry->status, "|", aer_uncorrectable_errors))
>>
>>
>> Tracepoints work by having a binary ringbuffer where the data at TP_STRUCT
>> is passed. In this case, it passes a string, an u32 and an u8.
>>
>> The TP_printk() macro actually produces a format filenode, describing
>> each field that would be filled in the ringbuffer and how to convert the
>> data structure from binary to text:
>>
>> # cat /sys/kernel/debug/tracing/events/ras/aer_event/format 
>> name: aer_event
>> ID: 877
>> format:
>> 	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
>> 	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
>> 	field:unsigned char common_preempt_count;	offset:3;	size:1;signed:0;
>> 	field:int common_pid;	offset:4;	size:4;	signed:1;
>>
>> 	field:__data_loc char[] dev_name;	offset:8;	size:4;	signed:1;
>> 	field:u32 status;	offset:12;	size:4;	signed:0;
>> 	field:u8 severity;	offset:16;	size:1;	signed:0;
>>
>> print fmt: "%s PCIe Bus Error: severity=%s, %s
>> ", __get_str(dev_name), REC->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : REC->severity == HW_EVENT_ERR_FATAL ? "Fatal" : "Uncorrected", REC->severity == HW_EVENT_ERR_CORRECTED ? __print_flags(REC->status, "|", {(1UL << (0)), "Receiver Error"}, {(1UL << (6)), "Bad TLP"}, {(1UL << (7)), "Bad DLLP"}, {(1UL << (8)), "RELAY_NUM Rollover"}, {(1UL << (12)), "Replay Timer Timeout"}, {(1UL << (13)), "Advisory Non-Fatal"}) : __print_flags(REC->status, "|", {(1UL << (4)), "Data Link Protocol"}, {(1UL << (12)), "Poisoned TLP"}, {(1UL << (13)), "Flow Control Protocol"}, {(1UL << (14)), "Completion Timeout"}, {(1UL << (15)), "Completer Abort"}, {(1UL << (16)), "Unexpected Completion"}, {(1UL << (17)), "Receiver Overflow"}, {(1UL << (18)), "Malformed TLP"}, {(1UL << (19)), "ECRC"}, {(1UL << (20)), "Unsupported Request"})
>>
>> You can see that _print_flags() actually produced a lot of code inside "print fmt".
>>
>> When someone reads from /sys/kernel/debug/tracing/trace, the print fmt is
>> handled by the Kernel.
>>
>> However, when someone uses a performance tool like perf, the binary blobs
>> are read from the per-CPU raw trace filenodes:
>> 	/sys/kernel/debug/tracing/per_cpu/cpu*/trace_pipe_raw
>>
>> And the "print fmt" line at the format file is actually used as a hint
>> for tracing userspace tool to handle it that can be used or not.
>> For example, at the rasdaemon, I'm overriding all default handlers,
>> using my own handler, as I want to store all events on a sqlite database:
>>
>> 	https://git.fedorahosted.org/cgit/rasdaemon.git/tree/ras-aer-handler.c
> 
> OK, fair enough. So that code can't be shared. Good to know.
> 
>>
>>>
>>> Bottom line as far as I am concerned: it was merged while I have outstanding
>>> questions.
>>
>> Well, only on this email you're letting clear what are your concerns.
>>
>>> If there are good technical reasons why the two problems stated above can't
>>> be easily resolved, then I will reconsider, but for now I think it is too early
>>> to merge. For the record, I don't think I mentioned the first problem in my
>>> original reply, it's something I noticed yesterday while looking at the patch
>>> again.
>>>
>>>>
>>>> Before this patch, all we have is (1), as a non-dynamic printk is too
>>>> intrusive to be used for performance measurement. So, there's no way to
>>>> measure how much time a buffer takes to be filled.
>>>
>>> Having tracepoints here doesn't tell you anything about that. At best it gives
>>> you information about the jitter.
>>>
>>> How are these tracepoints supposed to be used? What information will they
>>> provide? Are tracepoints expected to be limited to QBUF/DQBUF? Or other
>>> ioctls/fops as well? If so, which ones?
>>
>> As shown above, a performance tool can add handlers for tracepoint data,
>> allowing to discover, for example, if dequeue occurred on a different order
>> than queue, associating each dequeued buffer to the corresponding queued
>> one, and to measure the time between each queue/dequeue pairs.
> 
> I'm not really convinced about it. Which is why I would like to know a bit more
> how Wade uses it.

Put simply, we have an internal tool that can do some analysis of LTTng
formatted traces, and we'd like to use it to do some very simple
performance analyses of some V4L2 devices.  Our analyses really only
need to check queue/dequeue events, primarily for the reason Mauro
stated, measuring the time between such pairs.  Therefore, this is in no
way meant to be an exhaustive implementation, but it certainly could be
a starting point for such.

> 
> Also: the tracepoint is placed outside the lock which will make any time
> measurements even less reliable than they already are. Also, putting it
> inside the lock will ensure that the traces are generated in the order
> that they are serialized by the lock. In addition, the error code of the
> ioctl isn't traced either. Perhaps the traces should only be done if
> ret == 0 anyway?

I fully admit that my implementation may not be 100% correct, which is
part of the reason why I posted this as an RFC in the first place.

Moving this code inside the lock and adding the return value to the
trace should not be a problem.  If there are other specific things I
need to change to improve the patch, I would happy to do so.

Thanks for the feedback.

Regards,

Wade
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Input]     [Video for Linux]     [Gstreamer Embedded]     [Mplayer Users]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]
  Powered by Linux