On 06.02.2017 11:39, Thierry Reding wrote: > On Fri, Feb 03, 2017 at 09:54:42AM +0100, Andrzej Hajda wrote: >> Hi Thierry, >> >> Finally something technical :) >> >> On 02.02.2017 18:58, Thierry Reding wrote: >>> On Tue, Jan 31, 2017 at 01:05:20PM +0100, Andrzej Hajda wrote: >>>> On 31.01.2017 09:54, Thierry Reding wrote: >>>>> On Tue, Jan 31, 2017 at 09:01:07AM +0900, Inki Dae wrote: >>>>>> 2017년 01월 24일 10:50에 Hoegeun Kwon 이(가) 쓴 글: >>>>>>> Dear Thierry, >>>>>>> >>>>>>> Could you please review this patch? >>>>>> Thierry, I think this patch has been reviewed enough but no comment >>>>>> from you. Seems you are busy. I will pick up this. >>>>> Sorry, but that's not how it works. This patch has gone through 8 >>>>> revisions within 4 weeks, and I tend to ignore patches like that until >>>>> the dust settles. >>>>> >>>>> Other than that, this continues the same madness that I've repeatedly >>>>> complained about in the past. The whole mechanism of running through a >>>>> series of writes and not caring about errors until the very end is >>>>> something we've discussed at length in the past. >>>> Yes, we have discussed the pattern, but without any conclusion. The >>>> pattern is correct, used in different places in kernel (see below for >>>> examples) and significantly decreases source code size. Disallowing it >>>> in panels subsystem just because of personal preferences of the >>>> maintainer does not seem to be proper. >>>> >>>>> It also in large parts >>>>> duplicates a bunch of functions from other Samsung panel drivers that I >>>>> already said should eventually be moved to something saner. >>>> Currently there are two Samsung panel drivers, one is on SPI bus, >>>> another one is on DSI. >>>> I am (co-)author of both drivers, they have some similarities but I did >>>> not see any gain in making additional abstractions over transport layer >>>> just to make one or two small functions common. >>>> Could you be more precise what are you talking about, or could you give >>>> a link to the mail where you said it. Anything I remember was a >>>> discussion about ugly magic initialization sequences due to poor >>>> documentation. >>>> >>>> And below promised examples of the error pattern, it was time consuming >>>> to find them, so please at least read them :) >>> I've done better, below is what I hope a list of good arguments why the >>> pattern is a bad idea, and why in some cases it can be justified. >>> >>>> Almost exactly the same patterns for the same purpose: >>>> >>>> 1. http://lxr.free-electrons.com/source/drivers/net/ieee802154/atusb.c#L63 >>>> Citation from the code: >>>> To reduce the number of error checks in the code, we record the >>>> first error >>>> in atusb->err and reject all subsequent requests until the error is >>>> cleared. >>> That's about the worst example you could've used. Have you even looked >>> at the code that uses this? It's completely crazy. So here we have the >>> atusb_control_msg() function that stores this error, but then also >>> propagates it to its caller. One of these callers is atusb_read_reg(), >>> which also propagates the error or returns the register value if the >>> read was successful. >>> >>> Now the really insane part is how this is then used in something like >>> atusb_write_subreg(): >>> >>> orig = atusb_read_reg(atusb, reg); >>> tmp = orig & ~mask; >>> tmp |= (value << shift) & mask; >>> >>> if (tmp != orig) >>> ret = atusb_write_reg(atusb, reg, tmp); >>> >>> So let's assume that atusb_control_msg() fails in the call to >>> atusb_read_reg(). You'll be returning an error code, mask out some bits, >>> OR in new ones and write the result to the register. So not only does >>> the code not bother to check for errors, but it will also happily go and >>> corrupt registers when an error has occurred while reading. >> Not true, in case of error in atusb_read_reg atusb_write_reg will do >> nothing because atusb->err is set ! >> And this is strong point of the idiom - you will not be able to perform >> transfer until the error is explicitly cleared. > Yes, I see now that there's an additional check before further accesses > can be made. So I have to apologize to the author and it turns out that > the code isn't crazy. I suppose it's me that's too stupid to understand > how this works. > > I wouldn't have been confused if this was using a more idiomatic way of > doing things. The mere fact that code will read/modify/write registers > unconditionally suggests that there's no error checking going on. Even > if the code doesn't actually corrupt registers, it does have the effect > of executing a bunch of instructions completely unnecessarily. If there are really time consuming tasks you can add always error check before, as I showed below. Anyway this is error path so the delay because of the idiom will be usually insignificant comparing to time of hw error handling/recovery/timeouts etc. > > To add to the confusion there are other parts in the driver that don't > seem comfortable using the idiom and you'll actually see things like: > > ret = atusb_write_subreg(...); > if (ret) > return ret; > > If you ask me, that's bound to lead to mistakes at some point. I agree propagating error in two ways can be misleading, however I guess in some cases it can be convenient, anyway this is small detail. I guess it is still easier to forgot about error handling with classical approach, here you add checks only before actions with side effects, in classical approach you are obliged to put checks everywhere, to achieve the same goal. > >> With this idiom it is enough to put guards only in one/two/few places, >> in traditional error handling you just need to put checks after every >> function call and it is quite common situation that developers forgot to >> do that, for example look at mipi calls here [1] :) >> [1]: >> http://lxr.free-electrons.com/source/drivers/gpu/drm/i915/intel_dsi_panel_vbt.c#L160 >> >> So the function is correct, but to avoid fooling unaware readers I would >> add error checking after read: >> >> orig = atusb_read_reg(atusb, reg); >> if (atusb->err) >> return err; > But then what's the point of storing the error somewhere in the first > place? Because you have still multiple places where you do not need to add checks. And placing check here will save few cpu clocks, but of course without it the code is still correct, it is just matter of taste and balance between short code vs quick error paths. > >>>> 2. http://lxr.free-electrons.com/source/drivers/md/dm-thin.c?v=4.4#L917 >>> This is completely different from the panel driver because it's used to >>> store a value from within callbacks that can't return one. >> This is internal framework of the driver, so nothing prevents developer >> from implementing callbacks that return value. >> Apparently storing error in the object(struct dm_thin_new_mapping) is >> good and convenient :) > No, I think this comes from the fact that the device mapper and block > I/O frameworks have infrastructure that doesn't allow the error code to > be directly propagated (see the copy_complete() and overwrite_endio() > callbacks). > >>>> 3. http://lxr.free-electrons.com/source/net/9p/trans_fd.c?v=3.18#L297 >>> Essentially the same thing. >> The same here. > Yes, this seems like it could easily be done using propagated error > codes. > >>>> 4. >>>> http://lxr.free-electrons.com/source/drivers/media/v4l2-core/v4l2-ctrls.c#L2234 >>> Looks like this could be replaced by the more idiomatic use of ERR_PTR() >>> encoding error codes. But the most significant difference here is that >>> each use of the handler_set_err() function is followed by a return. So >>> instead of ignoring errors like you do in the panel drivers, this does >>> recognize them and aborts early, rather than trying to go through the >>> remainder of the code, irrespective of how small the chances are of it >>> succeeding. Or ignoring that /even if/ the remainder didn't fail, the >>> one operation that fail might have been essential to the operation of >>> the device. >> I have an impression that you do not understand the idiom. It does not >> allow to ignore error - as soon as the error is detected, guard is set >> and no further harm can be done. >> Going back to V4L2, look at the usage of the idiom [2]: >> v4l2_ctrl_add_handler(hdl_vid_cap, hdl_user_gen, NULL); >> v4l2_ctrl_add_handler(hdl_vid_cap, hdl_user_vid, NULL); >> v4l2_ctrl_add_handler(hdl_vid_cap, hdl_user_aud, NULL); >> v4l2_ctrl_add_handler(hdl_vid_cap, hdl_streaming, NULL); >> v4l2_ctrl_add_handler(hdl_vid_cap, hdl_sdtv_cap, NULL); >> v4l2_ctrl_add_handler(hdl_vid_cap, hdl_loop_cap, NULL); >> if (hdl_vid_cap->error) >> return hdl_vid_cap->error; >> >> Again, the point is you do not need to use old construct, which enlarges >> the code about three times: >> err = func(obj,...); >> if (err < 0) >> return err; >> >> Instead you just call func(obj,...), and the burden of error checking is >> put into the function, but for this err must be stored in object. >> >> [2]: >> http://lxr.free-electrons.com/source/drivers/media/platform/vivid/vivid-ctrls.c#L1630 > There are a couple of pitfalls here: for example what do you do if the > first two calls fail, but the third succeeds. No way, at the beginning of v4l2_ctrl_add_handler there is check: if (hdl->error <http://lxr.free-electrons.com/ident?i=error>) return hdl->error <http://lxr.free-electrons.com/ident?i=error>; In fact this check is also in other functions: handler_new_ref, v4l2_ctrl_new, handler_new_ref. With this check after first error no further code is executed. > The end result will be > that the whole operation fails. What do you do if one of the operations > has side-effects that you need to undo? If you don't know which one it > was, how will you know? You can always add explicit check after the operation which requires special handling, like in traditional approach. The point is you do not need to add checks which just propagate the error: "if (err < 0) return err;". The last sentence clearly suggests where the idiom is most suitable - the object on which you should perform bunch of operations and you are mainly interested if all operations succeeded. This is for example creation of V4L2 control handler - we are not interested in handler with some controls broken, we want fully working handler with all controls or nothing. Another example are devices attached to some bus which can fail, if we initialize it we want to perform bunch of writes, if any of write fails whole initialization fails also. > > And yes, this does allow you to ignore errors. The function calls before > the actual error handling will be executed regardless of errors that are > stored. While in many cases the code seems to correctly shortcut if the > guard is set, it also encodes many assumptions, not all of which may be > understood by morons like me. > > The end result is that the code becomes more vulnerable to accidental > mistakes. These will be hard to spot during review. Keeping code linear > and simple makes it much more difficult to introduce bugs. This is very disputable, error paths are the most error prone pieces of code :) There are many places where these errors are not propagated, or are propagated incorrectly, I suppose it is mainly due to fact it should be put everywhere. > >>>> 5. http://lxr.free-electrons.com/source/drivers/media/i2c/s5k5baf.c#L451 >>> That's a particularly good example of why you shouldn't be doing this >>> kind of thing. Consider what would happen in these cases if for example >>> there was a problem with the I2C interface. There's a bunch of read and >>> write sequences in that driver that go completely without checking for >>> errors. >> Not true, no further transfer will be performed. > Indeed. Clearly I wasn't looking closely enough. How about knowing > whether or not reads or writes succeed? To do so you would have to > explicitly check the error code again. If you need this information you can check, if don't you don't :) > >>> Imagine how that will look to a user when the communication to a >>> chip doesn't work. They'll get a load of error messages all saying that >>> communication timed out, or that the slave isn't responding or what not. >>> And worse still, for timeouts you're also going to introduce a bunch of >>> error messages interleaved with potentially other useful messages from >>> other drivers or the core. Depending on the number of access you have it >>> might take seconds for all of the error messages to drain before you >>> notice somewhere at the end of the code that something went wrong and >>> decide to abort whatever you were trying to do. >> Again, not true at all. > So after looking at this is more detail, I'll grant you that it gives > you a way to avoid checking for errors at every step of the way. There > is a disadvantage because you keep executing a lot of code > unnecessarily, and you make the code susceptible to mistakes because > it contains a non-obvious assumption that contributors to your code may > not be aware of. > >>>> This is my driver, I mention it here just to show it was not a >>>> problem to merge it to media subsystem. >>> That just shows what everybody knows: that maintainers care about >>> different things at different levels. >>> >>>> Similar patterns: >>>> >>>> 6. http://lxr.free-electrons.com/source/fs/seq_file.c#L398 >>>> Do not process object if buffer is full, it allows to do not check >>>> buffer size after every write, for example: >>>>> seq_printf(m, " hardirq-safe locks: %11lu\n", >>>>> nr_hardirq_safe); >>>>> seq_printf(m, " hardirq-unsafe locks: %11lu\n", >>>>> nr_hardirq_unsafe); >>>>> seq_printf(m, " softirq-safe locks: %11lu\n", >>>>> nr_softirq_safe); >>>>> seq_printf(m, " softirq-unsafe locks: %11lu\n", >>>>> nr_softirq_unsafe); >>>>> seq_printf(m, " irq-safe locks: %11lu\n", >>>>> nr_irq_safe); >>>>> seq_printf(m, " irq-unsafe locks: %11lu\n", >>>>> nr_irq_unsafe); >>>>> >>>>> seq_printf(m, " hardirq-read-safe locks: %11lu\n", >>>>> nr_hardirq_read_safe); >>>>> seq_printf(m, " hardirq-read-unsafe locks: %11lu\n", >>>>> nr_hardirq_read_unsafe); >>>>> seq_printf(m, " softirq-read-safe locks: %11lu\n", >>>>> nr_softirq_read_safe); >>>>> seq_printf(m, " softirq-read-unsafe locks: %11lu\n", >>>>> nr_softirq_read_unsafe); >>>>> seq_printf(m, " irq-read-safe locks: %11lu\n", >>>>> nr_irq_read_safe); >>>>> seq_printf(m, " irq-read-unsafe locks: %11lu\n", >>>>> nr_irq_read_unsafe); >>>>> >>>>> seq_printf(m, " uncategorized locks: %11lu\n", >>>>> nr_uncategorized); >>>>> seq_printf(m, " unused locks: %11lu\n", >>>>> nr_unused); >>>>> seq_printf(m, " max locking depth: %11u\n", >>>>> max_lockdep_depth); >>>> Now try to imagine how it would look like if you add error checking >>>> after each call. >>> I think that's a fairly sane use-case for this kind of pattern. The big >>> difference is that this condition is checked in subsequent accesses to >>> shortcut failure paths. It's also very different in that it performs >>> writes to memory and those aren't going to fail. The root cause of this >>> overflow would be static in nature and likely found doing basic testing >>> and fixed by making the buffer larger. >>> >>> That's contrary to a driver doing I/O (I2C, SPI, DSI, ...) that can fail >>> unexpectedly for any number of reasons. >> The idiom is still the same - storing error state in the object allows >> to move error checking inside called functions, removing much of >> redundant code from caller. > Yes, the idiom is the same, but the use-case is also completely > different, so I don't think you can apply the same rules. If you've got > communication with an external peripheral, like in the case of SPI, I2C > or DSI, I think the best thing to do is abort as early as possible on > failure. The most likely reason for failure would be a persistent cause > on the bus, in which case none of your accesses will succeed. If so I > think it makes sense to let the user know what exactly went wrong and > leave it at that. Continuing to execute code, even if no physical access > to the hardware happens anymore, is a waste of time. > > Failures on such busses can happen for any number of reasons, and they > can happen sporadically, so it's important to catch errors. For seq_buf > in particular you have a static failure case, which is if somebody were > trying to add more characters to the buffer than would fit. But callers > will, most of the time, put exactly the same number of characters into > that buffer. So if there was a failure once they'd notice quickly and > the issue would get fixed, at which point it becomes very unlikely to > ever happen again. > > So seq_buf is not expected to fail, whereas with peripheral busses you > can't make that assumption. The error is logged as soon as it happens, you can easily check that every time error is set, it is also logged. This is another advantage of the idiom - you do not need to analyze, if the error some function has returned was logged already, or you should do it in the caller. With the idiom you can log error in every place the error is set, to be sure that it will not be silently omitted, of course sometimes you can want to log in wider context, the idiom of course allows it. > >>>> 7. http://lxr.free-electrons.com/source/lib/devres.c#L129 >>>> Postponed error check: >>>>> */res = platform_get_resource(pdev, IORESOURCE_MEM, 0);/* >>>>> /*b*/*/ase = devm_ioremap_resource(&pdev->dev, res);/* >>>>> */if (IS_ERR(base))/* >>>>> /**/*/return PTR_ERR(base);/* >>> Heh, I wrote that =) >>> >>> This is also quite different from your usage in the panel driver. We do >>> not simply ignore failure from platform_get_resource(), instead we leave >>> it up to devm_ioremap_resource() to turn it into a meaningful error code >>> and message. The function does this very early on, so the error >>> condition is not ignored, as it is in the panel driver. >> Claims about the panel again not true, reasons described above. >> >> The similarity here is that you do not perform error checking in the >> caller, you just blindly pass the 'object' to the next function, and >> still everything is handled correctly. > But we're not passing any object to the next function. We're passing the > return value from an earlier call, one line above. We're not hiding an > error code in some driver-specific structure. So the code is completely > explicit. > > Maybe I should say a few words about why this is important to me because > you may think this to be somewhat pedantic. This is fundamentally caused > by the fact that once I merge code into a tree that I maintain, then the > responsibility for that code becomes mine. That means I will not only be > held accountable if it doesn't work, breaks or upsets build farms all > over the world. It also means that I am the one that will need to make > changes to it along the way. If the code is difficult for me to > understand because it doesn't behave the way I'm used to, then that > makes my life harder. > > Sometimes authors do stick around and care about their drivers, but > often they'll just disappear, or ignore patches and bug reports about > their drivers. So if worst comes to worst people will look to me to fix > their problems. That's why I want it to be easy to read and modify the > code that I maintain, because that way when users show up and complain > about drivers I don't have to spend hours trying to understand what the > original author was trying to achieve. But this is common problem of all subsystems, there is nothing special about panels. > > The above is a variant of what others have termed "obviously correct". > One good way to make code obviously correct is to make it simple. > I understand what you want to say, but I cannot fully agree with it. With such conservative approach Linux would not evolve at all :) We need some balance between "code exactly as maintainer says, otherwise you will be nacked" and "do whatever you want", but I think in case of panel subsystem requirements are unnecessarily very strict. Ok, this part of the discussion is less technical, I guess it would be good to have opinions of other maintainers about the subject. Regards Andrzej -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html