Powered by Linux
Re: Detecting user data on base types — Semantic Matching Tool

Re: Detecting user data on base types

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 05, 2019 at 12:47:38PM +0100, Andrew Murray wrote:
> On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> > On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:
> > > On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > > > > Another problem is that Smatch doesn't understand how the sign_extend64()
> > > > > function works so it doesn't understand the untagged_addr() macro.  :/
> > > > > I see one bug here and a missing feature...  Right now Smatch thinks
> > > > > that the return value is totally unknown and not user controlled.  This
> > > > > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> > > > 
> > > > Actually the ability to treat the return type as not user controlled makes
> > > > this a little simpler...
> > > > 
> > > > We want to flag a potential issue when a tagged address is used in a function
> > > > that wants untagged addresses (as marked by the address space annotation).
> > > > Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> > > > this condition.
> > > > 
> > > > When a user correctly calls untagged_addr prior to calling these functions
> > > > the macro helpfully drops the user controlled flag which stops the condition
> > > > above triggering.
> > > > 
> > > > Is there a way of getting smatch to drop the 'user controlled' flag without
> > > > relying on a bug?
> > > 
> > > We know the high byte is either 0 or 0xff so we could do something like:
> > > 
> > > 	struct symbol *type;
> > > 	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };
> > > 
> > > 	type = get_type(expr);
> > > 	if (!type || type_bits(type) != 64)
> > > 		return;
> > > 
> > > 	if (!get_user_rl(expr, &rl))
> > > 		return;
> > > 	if (rl_has_sval(rl, invalid))
> > > 		sm_warning("potential tagged issue '%s'", name);
> > > 
> > > I will fix the untagged_addr() handling very soon.
> > 
> > Thanks for this.
> > 
> > Though actually for MTE [1] the top byte may contain any value. But the above
> > could easily be updated.
> > 
> > > 
> > > > Thanks to your help I believe I have this working as follows:
> > > > 
> > > > diff --git a/check_list.h b/check_list.h
> > > > index b1d24c504ba5..f7551e7c5215 100644
> > > > --- a/check_list.h
> > > > +++ b/check_list.h
> > > > @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
> > > >  CK(check_spectre)
> > > >  CK(check_spectre_second_half)
> > > >  CK(check_implicit_dependencies)
> > > > +CK(check_tagged)
> > > >  
> > > >  /* wine specific stuff */
> > > >  CK(check_wine_filehandles)
> > > > diff --git a/check_tagged.c b/check_tagged.c
> > > > new file mode 100644
> > > > index 000000000000..d14a81a6c33a
> > > > --- /dev/null
> > > > +++ b/check_tagged.c
> > > > @@ -0,0 +1,49 @@
> > > > +#include "smatch.h"
> > > > +#include "smatch_extra.h"
> > > > +
> > > > +static void untagged_check(struct expression *expr)
> > > > +{
> > > > +       char *name;
> > > > +
> > > > +       if (parse_error)
> > > > +               return;
> > > > +
> > > > +       if (is_impossible_path())
> > > > +               return;
> > > > +
> > > > +       if (expr->type == EXPR_PREOP)
> > > > +               return;
> > > > +
> > > > +       if (is_user_rl(expr)) {
> > > > +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> > > > +                       name = expr_to_str(expr);
> > > > +                       sm_warning("potential tagged issue '%s'", name);
> > > > +                       free_string(name);
> > > > +               }
> > > > +       }
> > > > +}
> > > > +
> > > > +void check_tagged(int id)
> > > > +{
> > > > +       if (option_project != PROJ_KERNEL)
> > > > +               return;
> > > > +
> > > > +       add_hook(&untagged_check, EXPR_HOOK);
> > > > +}
> > > >  
> > > > This correctly identifies functions that have been annotated with the __untagged
> > > > annotation that use data from userspace. However, it's not actually that
> > > > useful...
> > > > 
> > > > For example, if I annotate the find_vma function, then it will flag this as a
> > > > potential tagged issue given that its called (from at least some contexts) with
> > > > userspace data. However find_vma is called all over the kernel, and so it's
> > > > difficult to figure out which caller called find_vma with user data.
> > > > 
> > > > It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> > > > 'find_vma' target to look for all the callers where this is a leaf function, but
> > > > this doesn't track the call parameters to exclude paths where the leaf annotated
> > > > function contains user data.
> > > 
> > > Huh...  I haven't thought about that script in years...  :/
> > > 
> > > What I do is I download the latest linux-next ever day and rebuild my DB
> > > every day.  Each time you rebuild it, the call tree gets filled out
> > > a bit.  After about a week then the DB is as complete as it is going to
> > > get.
> > > 
> > > Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> > > I should add an option so that it only shows callers which pass user
> > > data.  Each call site has a unique caller ID.
> > 
> > Thanks - I hadn't looked at that script, but looks very useful.
> > 
> > By the way with the following hunk you can, for a given function, which call sites
> > pass user data.
> > 
> > @@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
> >      print_caller_info(filename, func)
> >  elif sys.argv[1] == "user_data":
> >      func = sys.argv[2]
> > +    filename = sys.argv[3]
> >      print_caller_info(filename, func, "USER_DATA")
> >  elif sys.argv[1] == "param_value":
> >      func = sys.argv[2]
> > 
> > Though I've found it's quite useful to manually type commands in the sqlite3 CLI.
> > 
> > > 
> > > 
> > > > I haven't fully understood var_user_rl - but it looks like functions such as
> > > > match_user_assign_function propogate the user_data tag on each run to update
> > > > state, and thus it's probably not easy to get the original source of user
> > > > data after is_user_rl determines its user. Is that correct?
> > > 
> > > Use the get_user_rl() function instead.  The var_user_rl() is only
> > > supposed to be used internally, to do math.
> > 
> > OK.
> > 
> > > 
> > > > 
> > > > I would be great if there was a way to obtain the function that provided the
> > > > original source of user data - then it would be possible to write a script that
> > > > shows the path (call chain) between the two functions - I hoped smatch_scripts/
> > > > call_tree.pl could do this but it doesn't seem to do anything for me.
> > > 
> > > That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
> > > so I can do CTRL-c to see how a function is called or CTRL-r to see
> > > what it returns
> > > 
> > > # map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > > # map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > > 
> > 
> > I've spent some time playing with this - thanks.
> > 
> > Through this discussion I'm able to detect when annotated function parameters contain
> > user provided values. The challenge for me is to detect where that data originated
> > from (i.e. following the parameter up the call tree) to ease debugging.
> > 
> > My first attempt didn't trace the parameters and just looked at the call tree for any
> > functions which provided user data, however this resulted in false positives (e.g.
> > just because a function higher up in the call stack passed user data, it doesn't mean
> > it was this data that made it to the target function).
> > 
> > I've made some progress by adapting the trace_param feature of smdb.py - I've modified
> > it such that for a given symbol (e.g. find_vma), it will trace the parameters up the
> > call tree, if it sees a function where the param is user data (8017) then it prints the
> > symbol. This seems to work, but it doesn't seem to catch everything it should.
> > 
> > For example, find_vma is called by apply_vma_lock_flags which is called by do_mlock,
> > probing the database:
> > 
> > sqlite> select * from caller_info where function='find_vma' and (parameter = -1 or parameter = 1) and caller='apply_vma_lock_flags';         
> > file|caller|function|call_id|static|type|parameter|key|value
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|0|-1||struct vm_area_struct*(*)(struct mm_struct*, ulong)
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1004|1|$|1
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1014|1|$|p 0
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1001|1|$|0,4096-18446744073709547520
> > 
> > The third entry indicates that find_vma was called by apply_vma_lock_flags and find_vma(arg 1) == apply_vma_lock_flags(arg 0),
> > this is correct, moving up the call tree:
> > 
> > sqlite> select * from caller_info where function='apply_vma_lock_flags' and (parameter = -1 or parameter = 0) and caller='do_mlock';
> > file|caller|function|call_id|static|type|parameter|key|value
> > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|0|-1||int(*)(ulong, ulong, ulong)
> > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1004|0|$|1
> > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1001|0|$|0,4096-18446744073709547520
> > 
> > The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
> > apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
> > our tracing of params stops early. Do you have any clue why this may be?
> 
> It seems that this one is due to the 'param_was_set' check in smatch_data_source.c,
> removing this check overcomes this issue. Is this check necessary? Or perhaps this
> should have returned true?
> 
> Also should "get_user" be added to smatch_kernel_user_data.c in addition to
> copy_from_user?

Actually it looks there is logic associated to get_user in handle_get_user within
smatch_kernel_user_data.c - however this doesn't seem to trigger in the following (it looks
like it does something with the return value of the macro, but this is just an error flag):

do_pages_move -calls-> add_page_for_migration - the second argument of
add_page_for_migration 'addr' comes from get_user - but it isn't marked as USER_DATA in
the database. I don't understand why this is.

Thanks,

Andrew Murray

> 
> Thanks,
> 
> Andrew Murray
> 
> > 
> > I've rebuilt the database in a loop with ~/smatch/smatch_scripts/build_kernel_data.sh.
> > 
> > Thanks,
> > 
> > Andrew Murray
> > 
> > [1] https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM-2018.pdf
> > 
> > > regards,
> > > dan carpenter



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Big List of Linux Books]

  Powered by Linux