Re: [PATCH v2 1/1] binder: fix freeze race

Li Li <dualli@xxxxxxxxxxxx> · Mon, 13 Sep 2021 10:18:37 -0700

On Fri, Sep 10, 2021 at 12:15 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, Sep 09, 2021 at 11:17:42PM -0700, Li Li wrote:
> > On Thu, Sep 9, 2021 at 10:38 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Sep 09, 2021 at 08:53:16PM -0700, Li Li wrote:
> > > >  struct binder_frozen_status_info {
> > > >       __u32            pid;
> > > > +
> > > > +     /* process received sync transactions since last frozen
> > > > +      * bit 0: received sync transaction after being frozen
> > > > +      * bit 1: new pending sync transaction during freezing
> > > > +      */
> > > >       __u32            sync_recv;
> > >
> > > You just changed a user/kernel api here, did you just break existing
> > > userspace applications?  If not, how did that not happen?
> > >
> >
> > That's a good question. This design does keep backward compatibility.
> >
> > The existing userspace applications call ioctl(BINDER_GET_FROZEN_INFO)
> > to check if there's sync or async binder transactions sent to a frozen
> > process.
> >
> > If the existing userspace app runs on a new kernel, a sync binder call
> > still sets bit 1 of sync_recv (as it's a bool variable) so the ioctl
> > will return the expected value (TRUE). The app just doesn't check bit
> > 1 intentionally so it doesn't have the ability to tell if there's a
> > race - this behavior is aligned with what happens on an old kernel as
> > the old kernel doesn't have bit 1 set at all.
> >
> > The bit 1 of sync_recv enables new userspace app the ability to tell
> > 1) there's a sync binder transaction happened when being frozen - same
> > as before; and 2) if that sync binder transaction happens exactly when
> > there's a race - a new information for rollback decision.
>
> Ah, can you add that to the changelog text to make it more obvious?
>
Sure, added that to V3, plus other minor improvements listed in the cover
letter. Please let me know if there's anything else I should continue
to improve.

https://lore.kernel.org/lkml/20210910164210.2282716-1-dualli@xxxxxxxxxxxx/

BTW, I had a stress test running, repeatedly freezing and unfreezing a
couple apps every second, which at the same time initiates new binder
transactions in a loop. The overnight stress test during the past
weekend showed positive results. Without this kernel patch, the reply
transaction will fail in tens of iterations. With this kernel patch
and the corresponding user space fix (rescheduling the freeze op to
next second in case race happens), the stress test runs for 24hrs
without a single failure.

Thanks,
Li
_______________________________________________
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxx
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel