On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote: > On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote: > > On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote: > > > > Another problem is that Smatch doesn't understand how the sign_extend64() > > > > function works so it doesn't understand the untagged_addr() macro. :/ > > > > I see one bug here and a missing feature... Right now Smatch thinks > > > > that the return value is totally unknown and not user controlled. This > > > > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop(). > > > > > > Actually the ability to treat the return type as not user controlled makes > > > this a little simpler... > > > > > > We want to flag a potential issue when a tagged address is used in a function > > > that wants untagged addresses (as marked by the address space annotation). > > > Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet > > > this condition. > > > > > > When a user correctly calls untagged_addr prior to calling these functions > > > the macro helpfully drops the user controlled flag which stops the condition > > > above triggering. > > > > > > Is there a way of getting smatch to drop the 'user controlled' flag without > > > relying on a bug? > > > > We know the high byte is either 0 or 0xff so we could do something like: > > > > struct symbol *type; > > sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 }; > > > > type = get_type(expr); > > if (!type || type_bits(type) != 64) > > return; > > > > if (!get_user_rl(expr, &rl)) > > return; > > if (rl_has_sval(rl, invalid)) > > sm_warning("potential tagged issue '%s'", name); > > > > I will fix the untagged_addr() handling very soon. > > Thanks for this. > > Though actually for MTE [1] the top byte may contain any value. But the above > could easily be updated. > > > > > > Thanks to your help I believe I have this working as follows: > > > > > > diff --git a/check_list.h b/check_list.h > > > index b1d24c504ba5..f7551e7c5215 100644 > > > --- a/check_list.h > > > +++ b/check_list.h > > > @@ -192,6 +192,7 @@ CK(check_nospec_barrier) > > > CK(check_spectre) > > > CK(check_spectre_second_half) > > > CK(check_implicit_dependencies) > > > +CK(check_tagged) > > > > > > /* wine specific stuff */ > > > CK(check_wine_filehandles) > > > diff --git a/check_tagged.c b/check_tagged.c > > > new file mode 100644 > > > index 000000000000..d14a81a6c33a > > > --- /dev/null > > > +++ b/check_tagged.c > > > @@ -0,0 +1,49 @@ > > > +#include "smatch.h" > > > +#include "smatch_extra.h" > > > + > > > +static void untagged_check(struct expression *expr) > > > +{ > > > + char *name; > > > + > > > + if (parse_error) > > > + return; > > > + > > > + if (is_impossible_path()) > > > + return; > > > + > > > + if (expr->type == EXPR_PREOP) > > > + return; > > > + > > > + if (is_user_rl(expr)) { > > > + if (expr->symbol && expr->symbol->ctype.as == 5) { > > > + name = expr_to_str(expr); > > > + sm_warning("potential tagged issue '%s'", name); > > > + free_string(name); > > > + } > > > + } > > > +} > > > + > > > +void check_tagged(int id) > > > +{ > > > + if (option_project != PROJ_KERNEL) > > > + return; > > > + > > > + add_hook(&untagged_check, EXPR_HOOK); > > > +} > > > > > > This correctly identifies functions that have been annotated with the __untagged > > > annotation that use data from userspace. However, it's not actually that > > > useful... > > > > > > For example, if I annotate the find_vma function, then it will flag this as a > > > potential tagged issue given that its called (from at least some contexts) with > > > userspace data. However find_vma is called all over the kernel, and so it's > > > difficult to figure out which caller called find_vma with user data. > > > > > > It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and > > > 'find_vma' target to look for all the callers where this is a leaf function, but > > > this doesn't track the call parameters to exclude paths where the leaf annotated > > > function contains user data. > > > > Huh... I haven't thought about that script in years... :/ > > > > What I do is I download the latest linux-next ever day and rebuild my DB > > every day. Each time you rebuild it, the call tree gets filled out > > a bit. After about a week then the DB is as complete as it is going to > > get. > > > > Then I use the smatch_data/db/smdb.py script to figure out the warnings. > > I should add an option so that it only shows callers which pass user > > data. Each call site has a unique caller ID. > > Thanks - I hadn't looked at that script, but looks very useful. > > By the way with the following hunk you can, for a given function, which call sites > pass user data. > > @@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info": > print_caller_info(filename, func) > elif sys.argv[1] == "user_data": > func = sys.argv[2] > + filename = sys.argv[3] > print_caller_info(filename, func, "USER_DATA") > elif sys.argv[1] == "param_value": > func = sys.argv[2] > > Though I've found it's quite useful to manually type commands in the sqlite3 CLI. > > > > > > > > I haven't fully understood var_user_rl - but it looks like functions such as > > > match_user_assign_function propogate the user_data tag on each run to update > > > state, and thus it's probably not easy to get the original source of user > > > data after is_user_rl determines its user. Is that correct? > > > > Use the get_user_rl() function instead. The var_user_rl() is only > > supposed to be used internally, to do math. > > OK. > > > > > > > > > I would be great if there was a way to obtain the function that provided the > > > original source of user data - then it would be possible to write a script that > > > shows the path (call chain) between the two functions - I hoped smatch_scripts/ > > > call_tree.pl could do this but it doesn't seem to do anything for me. > > > > That's the smatch_data/db/smdb.py script. I have it as an alias in vim > > so I can do CTRL-c to see how a function is called or CTRL-r to see > > what it returns > > > > # map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR> > > # map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR> > > > > I've spent some time playing with this - thanks. > > Through this discussion I'm able to detect when annotated function parameters contain > user provided values. The challenge for me is to detect where that data originated > from (i.e. following the parameter up the call tree) to ease debugging. > > My first attempt didn't trace the parameters and just looked at the call tree for any > functions which provided user data, however this resulted in false positives (e.g. > just because a function higher up in the call stack passed user data, it doesn't mean > it was this data that made it to the target function). > > I've made some progress by adapting the trace_param feature of smdb.py - I've modified > it such that for a given symbol (e.g. find_vma), it will trace the parameters up the > call tree, if it sees a function where the param is user data (8017) then it prints the > symbol. This seems to work, but it doesn't seem to catch everything it should. > > For example, find_vma is called by apply_vma_lock_flags which is called by do_mlock, > probing the database: > > sqlite> select * from caller_info where function='find_vma' and (parameter = -1 or parameter = 1) and caller='apply_vma_lock_flags'; > file|caller|function|call_id|static|type|parameter|key|value > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|0|-1||struct vm_area_struct*(*)(struct mm_struct*, ulong) > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1004|1|$|1 > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1014|1|$|p 0 > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1001|1|$|0,4096-18446744073709547520 > > The third entry indicates that find_vma was called by apply_vma_lock_flags and find_vma(arg 1) == apply_vma_lock_flags(arg 0), > this is correct, moving up the call tree: > > sqlite> select * from caller_info where function='apply_vma_lock_flags' and (parameter = -1 or parameter = 0) and caller='do_mlock'; > file|caller|function|call_id|static|type|parameter|key|value > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|0|-1||int(*)(ulong, ulong, ulong) > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1004|0|$|1 > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1001|0|$|0,4096-18446744073709547520 > > The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of > apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and > our tracing of params stops early. Do you have any clue why this may be? It seems that this one is due to the 'param_was_set' check in smatch_data_source.c, removing this check overcomes this issue. Is this check necessary? Or perhaps this should have returned true? Also should "get_user" be added to smatch_kernel_user_data.c in addition to copy_from_user? Thanks, Andrew Murray > > I've rebuilt the database in a loop with ~/smatch/smatch_scripts/build_kernel_data.sh. > > Thanks, > > Andrew Murray > > [1] https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM-2018.pdf > > > regards, > > dan carpenter