On 12/11/2013 03:53 PM, Wade Farnsworth wrote: > Hi Hans, > > Hans Verkuil wrote: >> On 12/11/13 13:44, Mauro Carvalho Chehab wrote: >>> Em Wed, 11 Dec 2013 12:53:55 +0100 >>> Hans Verkuil <hverkuil@xxxxxxxxx> escreveu: >>> >>>> Hi Mauro, >>>> >>>> On 12/11/13 11:44, Mauro Carvalho Chehab wrote: >>>>> Hans, >>>>> >>>>> Em Wed, 11 Dec 2013 08:29:58 +0100 >>>>> Hans Verkuil <hverkuil@xxxxxxxxx> escreveu: >>>>> >>>>>> On 12/10/2013 09:53 PM, Mauro Carvalho Chehab wrote: >>>>>>> Em Sat, 23 Nov 2013 17:54:48 +0100 >>>>>>> Hans Verkuil <hverkuil@xxxxxxxxx> escreveu: >>>>>>> >>>>>>>> On 11/23/2013 05:30 PM, Sylwester Nawrocki wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> On 11/23/2013 12:25 PM, Hans Verkuil wrote: >>>>>>>>>> Hi Wade, >>>>>>>>>> >>>>>>>>>> On 11/22/2013 08:48 PM, Wade Farnsworth wrote: >>>>>>>>>>> Add tracepoints to the QBUF and DQBUF ioctls to enable rudimentary >>>>>>>>>>> performance measurements using standard kernel tracers. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Wade Farnsworth<wade_farnsworth@xxxxxxxxxx> >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> This is the update to the RFC patch I posted a few weeks back. I've added >>>>>>>>>>> several bits of metadata to the tracepoint output per Mauro's suggestion. >>>>>>>>>> >>>>>>>>>> I don't like this. All v4l2 ioctls can already be traced by doing e.g. >>>>>>>>>> echo 1 (or echo 2)>/sys/class/video4linux/video0/debug. >>>>>>>>>> >>>>>>>>>> So this code basically duplicates that functionality. It would be nice to be able >>>>>>>>>> to tie in the existing tracing code (v4l2-ioctl.c) into tracepoints. >>>>>>>>> >>>>>>>>> I think it would be really nice to have this kind of support for standard >>>>>>>>> traces at the v4l2 subsystem. Presumably it could even gradually replace >>>>>>>>> the v4l2 custom debug infrastructure. >>>>>>>>> >>>>>>>>> If I understand things correctly, the current tracing/profiling >>>>>>>>> infrastructure >>>>>>>>> is much less invasive than inserting printks all over, which may cause >>>>>>>>> changes >>>>>>>>> in control flow. I doubt the system could be reliably profiled by >>>>>>>>> enabling all >>>>>>>>> those debug prints. >>>>>>>>> >>>>>>>>> So my vote would be to add support for standard tracers, like in other >>>>>>>>> subsystems in the kernel. >>>>>>>> >>>>>>>> The reason for the current system is to trace which ioctls are called in >>>>>>>> what order by a misbehaving application. It's very useful for that, >>>>>>>> especially when trying to debug user problems. >>>>>>>> >>>>>>>> I don't mind switching to tracepoints as long as this functionality is >>>>>>>> kept one way or another. >>>>>>> >>>>>>> I agree with Sylwester: we should move to tracepoints, and this is a good >>>>>>> start. >>>>>> >>>>>> As I mentioned, I don't mind switching to tracepoints, but not in the way the >>>>>> current patch does it. I certainly don't agree with you merging this patch >>>>>> as-is without further discussion. >>>>> >>>>> Let's not mix the subjects here. There are two different demands that can be >>>>> either fulfilled by the same solution or by different ones: >>>>> 1) To have a way to enable debugging all parameters passed from/to >>>>> userspace; >>>>> 2) To be able to measure the driver (and core) performance while >>>>> streaming. >>>>> >>>>> This patch's scope is to solve (2), by allowing userspace to hook the two >>>>> ioctls that queues/dequeues streaming buffers. >>>>> >>>>> With that regards, what's wrong on this patch? I couldn't see anything bad >>>>> there. It is not meant to solve (1). >>>> >>>> I have two problems with it: >>>> >>>> 1) Right now it is just checking for two ioctls, but as soon as we add more >>>> tracepoints you get another switch on the ioctl command, something I've tried >>>> to avoid by creating the table. So a table-oriented implementation would be >>>> much preferred. >>> >>> From performance measurement PoV, I don't see any reason to add it to any >>> other ioctl. >> >> But perhaps this makes sense as well for read() and write()? Possibly poll()? >> >>> Of course if we revisit it and other traces become needed, >>> then we should use a table approach instead. >>> >>>> >>>> 2) It duplicates the already existing code to dump the v4l2_buffer struct. >>>> Is there a way to have just one copy of that? I was hoping Wade could look >>>> into that, and if not, then it is something I can explore (although finding >>>> time is an issue). >>> >>> Having implemented tracepoints myself, I'd say that the answer is no. >>> >>> Let me give you an example. >>> >>> The ras:aer_event trace is a simple tracepoint that also uses the >>> __print_flags() macro. It is implemented at include/trace/events/ras.h >>> as: >>> >>> #define aer_correctable_errors \ >>> {BIT(0), "Receiver Error"}, \ >>> {BIT(6), "Bad TLP"}, \ >>> {BIT(7), "Bad DLLP"}, \ >>> {BIT(8), "RELAY_NUM Rollover"}, \ >>> {BIT(12), "Replay Timer Timeout"}, \ >>> {BIT(13), "Advisory Non-Fatal"} >>> >>> #define aer_uncorrectable_errors \ >>> {BIT(4), "Data Link Protocol"}, \ >>> {BIT(12), "Poisoned TLP"}, \ >>> {BIT(13), "Flow Control Protocol"}, \ >>> {BIT(14), "Completion Timeout"}, \ >>> {BIT(15), "Completer Abort"}, \ >>> {BIT(16), "Unexpected Completion"}, \ >>> {BIT(17), "Receiver Overflow"}, \ >>> {BIT(18), "Malformed TLP"}, \ >>> {BIT(19), "ECRC"}, \ >>> {BIT(20), "Unsupported Request"} >>> >>> TRACE_EVENT(aer_event, >>> TP_PROTO(const char *dev_name, >>> const u32 status, >>> const u8 severity), >>> >>> TP_ARGS(dev_name, status, severity), >>> >>> TP_STRUCT__entry( >>> __string( dev_name, dev_name ) >>> __field( u32, status ) >>> __field( u8, severity ) >>> ), >>> >>> TP_fast_assign( >>> __assign_str(dev_name, dev_name); >>> __entry->status = status; >>> __entry->severity = severity; >>> ), >>> >>> TP_printk("%s PCIe Bus Error: severity=%s, %s\n", >>> __get_str(dev_name), >>> __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : >>> __entry->severity == HW_EVENT_ERR_FATAL ? >>> "Fatal" : "Uncorrected", >>> __entry->severity == HW_EVENT_ERR_CORRECTED ? >>> __print_flags(__entry->status, "|", aer_correctable_errors) : >>> __print_flags(__entry->status, "|", aer_uncorrectable_errors)) >>> >>> >>> Tracepoints work by having a binary ringbuffer where the data at TP_STRUCT >>> is passed. In this case, it passes a string, an u32 and an u8. >>> >>> The TP_printk() macro actually produces a format filenode, describing >>> each field that would be filled in the ringbuffer and how to convert the >>> data structure from binary to text: >>> >>> # cat /sys/kernel/debug/tracing/events/ras/aer_event/format >>> name: aer_event >>> ID: 877 >>> format: >>> field:unsigned short common_type; offset:0; size:2; signed:0; >>> field:unsigned char common_flags; offset:2; size:1; signed:0; >>> field:unsigned char common_preempt_count; offset:3; size:1;signed:0; >>> field:int common_pid; offset:4; size:4; signed:1; >>> >>> field:__data_loc char[] dev_name; offset:8; size:4; signed:1; >>> field:u32 status; offset:12; size:4; signed:0; >>> field:u8 severity; offset:16; size:1; signed:0; >>> >>> print fmt: "%s PCIe Bus Error: severity=%s, %s >>> ", __get_str(dev_name), REC->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : REC->severity == HW_EVENT_ERR_FATAL ? "Fatal" : "Uncorrected", REC->severity == HW_EVENT_ERR_CORRECTED ? __print_flags(REC->status, "|", {(1UL << (0)), "Receiver Error"}, {(1UL << (6)), "Bad TLP"}, {(1UL << (7)), "Bad DLLP"}, {(1UL << (8)), "RELAY_NUM Rollover"}, {(1UL << (12)), "Replay Timer Timeout"}, {(1UL << (13)), "Advisory Non-Fatal"}) : __print_flags(REC->status, "|", {(1UL << (4)), "Data Link Protocol"}, {(1UL << (12)), "Poisoned TLP"}, {(1UL << (13)), "Flow Control Protocol"}, {(1UL << (14)), "Completion Timeout"}, {(1UL << (15)), "Completer Abort"}, {(1UL << (16)), "Unexpected Completion"}, {(1UL << (17)), "Receiver Overflow"}, {(1UL << (18)), "Malformed TLP"}, {(1UL << (19)), "ECRC"}, {(1UL << (20)), "Unsupported Request"}) >>> >>> You can see that _print_flags() actually produced a lot of code inside "print fmt". >>> >>> When someone reads from /sys/kernel/debug/tracing/trace, the print fmt is >>> handled by the Kernel. >>> >>> However, when someone uses a performance tool like perf, the binary blobs >>> are read from the per-CPU raw trace filenodes: >>> /sys/kernel/debug/tracing/per_cpu/cpu*/trace_pipe_raw >>> >>> And the "print fmt" line at the format file is actually used as a hint >>> for tracing userspace tool to handle it that can be used or not. >>> For example, at the rasdaemon, I'm overriding all default handlers, >>> using my own handler, as I want to store all events on a sqlite database: >>> >>> https://git.fedorahosted.org/cgit/rasdaemon.git/tree/ras-aer-handler.c >> >> OK, fair enough. So that code can't be shared. Good to know. >> >>> >>>> >>>> Bottom line as far as I am concerned: it was merged while I have outstanding >>>> questions. >>> >>> Well, only on this email you're letting clear what are your concerns. >>> >>>> If there are good technical reasons why the two problems stated above can't >>>> be easily resolved, then I will reconsider, but for now I think it is too early >>>> to merge. For the record, I don't think I mentioned the first problem in my >>>> original reply, it's something I noticed yesterday while looking at the patch >>>> again. >>>> >>>>> >>>>> Before this patch, all we have is (1), as a non-dynamic printk is too >>>>> intrusive to be used for performance measurement. So, there's no way to >>>>> measure how much time a buffer takes to be filled. >>>> >>>> Having tracepoints here doesn't tell you anything about that. At best it gives >>>> you information about the jitter. >>>> >>>> How are these tracepoints supposed to be used? What information will they >>>> provide? Are tracepoints expected to be limited to QBUF/DQBUF? Or other >>>> ioctls/fops as well? If so, which ones? >>> >>> As shown above, a performance tool can add handlers for tracepoint data, >>> allowing to discover, for example, if dequeue occurred on a different order >>> than queue, associating each dequeued buffer to the corresponding queued >>> one, and to measure the time between each queue/dequeue pairs. >> >> I'm not really convinced about it. Which is why I would like to know a bit more >> how Wade uses it. > > Put simply, we have an internal tool that can do some analysis of LTTng > formatted traces, and we'd like to use it to do some very simple > performance analyses of some V4L2 devices. Our analyses really only > need to check queue/dequeue events, primarily for the reason Mauro > stated, measuring the time between such pairs. Therefore, this is in no > way meant to be an exhaustive implementation, but it certainly could be > a starting point for such. > >> >> Also: the tracepoint is placed outside the lock which will make any time >> measurements even less reliable than they already are. Also, putting it >> inside the lock will ensure that the traces are generated in the order >> that they are serialized by the lock. In addition, the error code of the >> ioctl isn't traced either. Perhaps the traces should only be done if >> ret == 0 anyway? > > I fully admit that my implementation may not be 100% correct, which is > part of the reason why I posted this as an RFC in the first place. > > Moving this code inside the lock and adding the return value to the > trace should not be a problem. If there are other specific things I > need to change to improve the patch, I would happy to do so. Thanks Wade! I plan on taking a closer look tomorrow or Monday (probably tomorrow) and I might write a patch myself moving things around. One question: would it make sense to only do the trace if the (d)qbuf call was successful? I don't really see much purpose in doing the trace if an error occurred, certainly not from a performance PoV. But I don't have a strong opinion on this, so if you prefer to always trace it, then that's not a problem. Regards, Hans -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html