Re: Detecting user data on base types

Andrew Murray <andrew.murray@xxxxxxx> · Wed, 5 Jun 2019 12:47:38 +0100

On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:
> > On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > > > Another problem is that Smatch doesn't understand how the sign_extend64()
> > > > function works so it doesn't understand the untagged_addr() macro.  :/
> > > > I see one bug here and a missing feature...  Right now Smatch thinks
> > > > that the return value is totally unknown and not user controlled.  This
> > > > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> > > 
> > > Actually the ability to treat the return type as not user controlled makes
> > > this a little simpler...
> > > 
> > > We want to flag a potential issue when a tagged address is used in a function
> > > that wants untagged addresses (as marked by the address space annotation).
> > > Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> > > this condition.
> > > 
> > > When a user correctly calls untagged_addr prior to calling these functions
> > > the macro helpfully drops the user controlled flag which stops the condition
> > > above triggering.
> > > 
> > > Is there a way of getting smatch to drop the 'user controlled' flag without
> > > relying on a bug?
> > 
> > We know the high byte is either 0 or 0xff so we could do something like:
> > 
> > 	struct symbol *type;
> > 	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };
> > 
> > 	type = get_type(expr);
> > 	if (!type || type_bits(type) != 64)
> > 		return;
> > 
> > 	if (!get_user_rl(expr, &rl))
> > 		return;
> > 	if (rl_has_sval(rl, invalid))
> > 		sm_warning("potential tagged issue '%s'", name);
> > 
> > I will fix the untagged_addr() handling very soon.
> 
> Thanks for this.
> 
> Though actually for MTE [1] the top byte may contain any value. But the above
> could easily be updated.
> 
> > 
> > > Thanks to your help I believe I have this working as follows:
> > > 
> > > diff --git a/check_list.h b/check_list.h
> > > index b1d24c504ba5..f7551e7c5215 100644
> > > --- a/check_list.h
> > > +++ b/check_list.h
> > > @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
> > >  CK(check_spectre)
> > >  CK(check_spectre_second_half)
> > >  CK(check_implicit_dependencies)
> > > +CK(check_tagged)
> > >  
> > >  /* wine specific stuff */
> > >  CK(check_wine_filehandles)
> > > diff --git a/check_tagged.c b/check_tagged.c
> > > new file mode 100644
> > > index 000000000000..d14a81a6c33a
> > > --- /dev/null
> > > +++ b/check_tagged.c
> > > @@ -0,0 +1,49 @@
> > > +#include "smatch.h"
> > > +#include "smatch_extra.h"
> > > +
> > > +static void untagged_check(struct expression *expr)
> > > +{
> > > +       char *name;
> > > +
> > > +       if (parse_error)
> > > +               return;
> > > +
> > > +       if (is_impossible_path())
> > > +               return;
> > > +
> > > +       if (expr->type == EXPR_PREOP)
> > > +               return;
> > > +
> > > +       if (is_user_rl(expr)) {
> > > +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> > > +                       name = expr_to_str(expr);
> > > +                       sm_warning("potential tagged issue '%s'", name);
> > > +                       free_string(name);
> > > +               }
> > > +       }
> > > +}
> > > +
> > > +void check_tagged(int id)
> > > +{
> > > +       if (option_project != PROJ_KERNEL)
> > > +               return;
> > > +
> > > +       add_hook(&untagged_check, EXPR_HOOK);
> > > +}
> > >  
> > > This correctly identifies functions that have been annotated with the __untagged
> > > annotation that use data from userspace. However, it's not actually that
> > > useful...
> > > 
> > > For example, if I annotate the find_vma function, then it will flag this as a
> > > potential tagged issue given that its called (from at least some contexts) with
> > > userspace data. However find_vma is called all over the kernel, and so it's
> > > difficult to figure out which caller called find_vma with user data.
> > > 
> > > It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> > > 'find_vma' target to look for all the callers where this is a leaf function, but
> > > this doesn't track the call parameters to exclude paths where the leaf annotated
> > > function contains user data.
> > 
> > Huh...  I haven't thought about that script in years...  :/
> > 
> > What I do is I download the latest linux-next ever day and rebuild my DB
> > every day.  Each time you rebuild it, the call tree gets filled out
> > a bit.  After about a week then the DB is as complete as it is going to
> > get.
> > 
> > Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> > I should add an option so that it only shows callers which pass user
> > data.  Each call site has a unique caller ID.
> 
> Thanks - I hadn't looked at that script, but looks very useful.
> 
> By the way with the following hunk you can, for a given function, which call sites
> pass user data.
> 
> @@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
>      print_caller_info(filename, func)
>  elif sys.argv[1] == "user_data":
>      func = sys.argv[2]
> +    filename = sys.argv[3]
>      print_caller_info(filename, func, "USER_DATA")
>  elif sys.argv[1] == "param_value":
>      func = sys.argv[2]
> 
> Though I've found it's quite useful to manually type commands in the sqlite3 CLI.
> 
> > 
> > 
> > > I haven't fully understood var_user_rl - but it looks like functions such as
> > > match_user_assign_function propogate the user_data tag on each run to update
> > > state, and thus it's probably not easy to get the original source of user
> > > data after is_user_rl determines its user. Is that correct?
> > 
> > Use the get_user_rl() function instead.  The var_user_rl() is only
> > supposed to be used internally, to do math.
> 
> OK.
> 
> > 
> > > 
> > > I would be great if there was a way to obtain the function that provided the
> > > original source of user data - then it would be possible to write a script that
> > > shows the path (call chain) between the two functions - I hoped smatch_scripts/
> > > call_tree.pl could do this but it doesn't seem to do anything for me.
> > 
> > That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
> > so I can do CTRL-c to see how a function is called or CTRL-r to see
> > what it returns
> > 
> > # map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > # map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > 
> 
> I've spent some time playing with this - thanks.
> 
> Through this discussion I'm able to detect when annotated function parameters contain
> user provided values. The challenge for me is to detect where that data originated
> from (i.e. following the parameter up the call tree) to ease debugging.
> 
> My first attempt didn't trace the parameters and just looked at the call tree for any
> functions which provided user data, however this resulted in false positives (e.g.
> just because a function higher up in the call stack passed user data, it doesn't mean
> it was this data that made it to the target function).
> 
> I've made some progress by adapting the trace_param feature of smdb.py - I've modified
> it such that for a given symbol (e.g. find_vma), it will trace the parameters up the
> call tree, if it sees a function where the param is user data (8017) then it prints the
> symbol. This seems to work, but it doesn't seem to catch everything it should.
> 
> For example, find_vma is called by apply_vma_lock_flags which is called by do_mlock,
> probing the database:
> 
> sqlite> select * from caller_info where function='find_vma' and (parameter = -1 or parameter = 1) and caller='apply_vma_lock_flags';         
> file|caller|function|call_id|static|type|parameter|key|value
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|0|-1||struct vm_area_struct*(*)(struct mm_struct*, ulong)
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1004|1|$|1
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1014|1|$|p 0
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1001|1|$|0,4096-18446744073709547520
> 
> The third entry indicates that find_vma was called by apply_vma_lock_flags and find_vma(arg 1) == apply_vma_lock_flags(arg 0),
> this is correct, moving up the call tree:
> 
> sqlite> select * from caller_info where function='apply_vma_lock_flags' and (parameter = -1 or parameter = 0) and caller='do_mlock';
> file|caller|function|call_id|static|type|parameter|key|value
> mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|0|-1||int(*)(ulong, ulong, ulong)
> mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1004|0|$|1
> mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1001|0|$|0,4096-18446744073709547520
> 
> The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
> apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
> our tracing of params stops early. Do you have any clue why this may be?

It seems that this one is due to the 'param_was_set' check in smatch_data_source.c,
removing this check overcomes this issue. Is this check necessary? Or perhaps this
should have returned true?

Also should "get_user" be added to smatch_kernel_user_data.c in addition to
copy_from_user?

Thanks,

Andrew Murray

> 
> I've rebuilt the database in a loop with ~/smatch/smatch_scripts/build_kernel_data.sh.
> 
> Thanks,
> 
> Andrew Murray
> 
> [1] https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM-2018.pdf
> 
> > regards,
> > dan carpenter