Re: Detecting user data on base types

Andrew Murray <andrew.murray@xxxxxxx> · Wed, 5 Jun 2019 09:29:26 +0100




On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:
> On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > > Another problem is that Smatch doesn't understand how the sign_extend64()
> > > function works so it doesn't understand the untagged_addr() macro.  :/
> > > I see one bug here and a missing feature...  Right now Smatch thinks
> > > that the return value is totally unknown and not user controlled.  This
> > > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> > 
> > Actually the ability to treat the return type as not user controlled makes
> > this a little simpler...
> > 
> > We want to flag a potential issue when a tagged address is used in a function
> > that wants untagged addresses (as marked by the address space annotation).
> > Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> > this condition.
> > 
> > When a user correctly calls untagged_addr prior to calling these functions
> > the macro helpfully drops the user controlled flag which stops the condition
> > above triggering.
> > 
> > Is there a way of getting smatch to drop the 'user controlled' flag without
> > relying on a bug?
> 
> We know the high byte is either 0 or 0xff so we could do something like:
> 
> 	struct symbol *type;
> 	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };
> 
> 	type = get_type(expr);
> 	if (!type || type_bits(type) != 64)
> 		return;
> 
> 	if (!get_user_rl(expr, &rl))
> 		return;
> 	if (rl_has_sval(rl, invalid))
> 		sm_warning("potential tagged issue '%s'", name);
> 
> I will fix the untagged_addr() handling very soon.

Thanks for this.

Though actually for MTE [1] the top byte may contain any value. But the above
could easily be updated.

> 
> > Thanks to your help I believe I have this working as follows:
> > 
> > diff --git a/check_list.h b/check_list.h
> > index b1d24c504ba5..f7551e7c5215 100644
> > --- a/check_list.h
> > +++ b/check_list.h
> > @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
> >  CK(check_spectre)
> >  CK(check_spectre_second_half)
> >  CK(check_implicit_dependencies)
> > +CK(check_tagged)
> >  
> >  /* wine specific stuff */
> >  CK(check_wine_filehandles)
> > diff --git a/check_tagged.c b/check_tagged.c
> > new file mode 100644
> > index 000000000000..d14a81a6c33a
> > --- /dev/null
> > +++ b/check_tagged.c
> > @@ -0,0 +1,49 @@
> > +#include "smatch.h"
> > +#include "smatch_extra.h"
> > +
> > +static void untagged_check(struct expression *expr)
> > +{
> > +       char *name;
> > +
> > +       if (parse_error)
> > +               return;
> > +
> > +       if (is_impossible_path())
> > +               return;
> > +
> > +       if (expr->type == EXPR_PREOP)
> > +               return;
> > +
> > +       if (is_user_rl(expr)) {
> > +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> > +                       name = expr_to_str(expr);
> > +                       sm_warning("potential tagged issue '%s'", name);
> > +                       free_string(name);
> > +               }
> > +       }
> > +}
> > +
> > +void check_tagged(int id)
> > +{
> > +       if (option_project != PROJ_KERNEL)
> > +               return;
> > +
> > +       add_hook(&untagged_check, EXPR_HOOK);
> > +}
> >  
> > This correctly identifies functions that have been annotated with the __untagged
> > annotation that use data from userspace. However, it's not actually that
> > useful...
> > 
> > For example, if I annotate the find_vma function, then it will flag this as a
> > potential tagged issue given that its called (from at least some contexts) with
> > userspace data. However find_vma is called all over the kernel, and so it's
> > difficult to figure out which caller called find_vma with user data.
> > 
> > It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> > 'find_vma' target to look for all the callers where this is a leaf function, but
> > this doesn't track the call parameters to exclude paths where the leaf annotated
> > function contains user data.
> 
> Huh...  I haven't thought about that script in years...  :/
> 
> What I do is I download the latest linux-next ever day and rebuild my DB
> every day.  Each time you rebuild it, the call tree gets filled out
> a bit.  After about a week then the DB is as complete as it is going to
> get.
> 
> Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> I should add an option so that it only shows callers which pass user
> data.  Each call site has a unique caller ID.

Thanks - I hadn't looked at that script, but looks very useful.

By the way with the following hunk you can, for a given function, which call sites
pass user data.

@@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
     print_caller_info(filename, func)
 elif sys.argv[1] == "user_data":
     func = sys.argv[2]
+    filename = sys.argv[3]
     print_caller_info(filename, func, "USER_DATA")
 elif sys.argv[1] == "param_value":
     func = sys.argv[2]

Though I've found it's quite useful to manually type commands in the sqlite3 CLI.

> 
> 
> > I haven't fully understood var_user_rl - but it looks like functions such as
> > match_user_assign_function propogate the user_data tag on each run to update
> > state, and thus it's probably not easy to get the original source of user
> > data after is_user_rl determines its user. Is that correct?
> 
> Use the get_user_rl() function instead.  The var_user_rl() is only
> supposed to be used internally, to do math.

OK.

> 
> > 
> > I would be great if there was a way to obtain the function that provided the
> > original source of user data - then it would be possible to write a script that
> > shows the path (call chain) between the two functions - I hoped smatch_scripts/
> > call_tree.pl could do this but it doesn't seem to do anything for me.
> 
> That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
> so I can do CTRL-c to see how a function is called or CTRL-r to see
> what it returns
> 
> # map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> # map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> 

I've spent some time playing with this - thanks.

Through this discussion I'm able to detect when annotated function parameters contain
user provided values. The challenge for me is to detect where that data originated
from (i.e. following the parameter up the call tree) to ease debugging.

My first attempt didn't trace the parameters and just looked at the call tree for any
functions which provided user data, however this resulted in false positives (e.g.
just because a function higher up in the call stack passed user data, it doesn't mean
it was this data that made it to the target function).

I've made some progress by adapting the trace_param feature of smdb.py - I've modified
it such that for a given symbol (e.g. find_vma), it will trace the parameters up the
call tree, if it sees a function where the param is user data (8017) then it prints the
symbol. This seems to work, but it doesn't seem to catch everything it should.

For example, find_vma is called by apply_vma_lock_flags which is called by do_mlock,
probing the database:

sqlite> select * from caller_info where function='find_vma' and (parameter = -1 or parameter = 1) and caller='apply_vma_lock_flags';         
file|caller|function|call_id|static|type|parameter|key|value
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|0|-1||struct vm_area_struct*(*)(struct mm_struct*, ulong)
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1004|1|$|1
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1014|1|$|p 0
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1001|1|$|0,4096-18446744073709547520

The third entry indicates that find_vma was called by apply_vma_lock_flags and find_vma(arg 1) == apply_vma_lock_flags(arg 0),
this is correct, moving up the call tree:

sqlite> select * from caller_info where function='apply_vma_lock_flags' and (parameter = -1 or parameter = 0) and caller='do_mlock';
file|caller|function|call_id|static|type|parameter|key|value
mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|0|-1||int(*)(ulong, ulong, ulong)
mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1004|0|$|1
mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1001|0|$|0,4096-18446744073709547520

The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
our tracing of params stops early. Do you have any clue why this may be?

I've rebuilt the database in a loop with ~/smatch/smatch_scripts/build_kernel_data.sh.

Thanks,

Andrew Murray

[1] https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM-2018.pdf

> regards,
> dan carpenter