Re: Detecting user data on base types

Dan Carpenter <dan.carpenter@xxxxxxxxxx> · Thu, 30 May 2019 20:46:42 +0300

On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > Another problem is that Smatch doesn't understand how the sign_extend64()
> > function works so it doesn't understand the untagged_addr() macro.  :/
> > I see one bug here and a missing feature...  Right now Smatch thinks
> > that the return value is totally unknown and not user controlled.  This
> > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> 
> Actually the ability to treat the return type as not user controlled makes
> this a little simpler...
> 
> We want to flag a potential issue when a tagged address is used in a function
> that wants untagged addresses (as marked by the address space annotation).
> Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> this condition.
> 
> When a user correctly calls untagged_addr prior to calling these functions
> the macro helpfully drops the user controlled flag which stops the condition
> above triggering.
> 
> Is there a way of getting smatch to drop the 'user controlled' flag without
> relying on a bug?

We know the high byte is either 0 or 0xff so we could do something like:

	struct symbol *type;
	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };

	type = get_type(expr);
	if (!type || type_bits(type) != 64)
		return;

	if (!get_user_rl(expr, &rl))
		return;
	if (rl_has_sval(rl, invalid))
		sm_warning("potential tagged issue '%s'", name);

I will fix the untagged_addr() handling very soon.

> Thanks to your help I believe I have this working as follows:
> 
> diff --git a/check_list.h b/check_list.h
> index b1d24c504ba5..f7551e7c5215 100644
> --- a/check_list.h
> +++ b/check_list.h
> @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
>  CK(check_spectre)
>  CK(check_spectre_second_half)
>  CK(check_implicit_dependencies)
> +CK(check_tagged)
>  
>  /* wine specific stuff */
>  CK(check_wine_filehandles)
> diff --git a/check_tagged.c b/check_tagged.c
> new file mode 100644
> index 000000000000..d14a81a6c33a
> --- /dev/null
> +++ b/check_tagged.c
> @@ -0,0 +1,49 @@
> +#include "smatch.h"
> +#include "smatch_extra.h"
> +
> +static void untagged_check(struct expression *expr)
> +{
> +       char *name;
> +
> +       if (parse_error)
> +               return;
> +
> +       if (is_impossible_path())
> +               return;
> +
> +       if (expr->type == EXPR_PREOP)
> +               return;
> +
> +       if (is_user_rl(expr)) {
> +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> +                       name = expr_to_str(expr);
> +                       sm_warning("potential tagged issue '%s'", name);
> +                       free_string(name);
> +               }
> +       }
> +}
> +
> +void check_tagged(int id)
> +{
> +       if (option_project != PROJ_KERNEL)
> +               return;
> +
> +       add_hook(&untagged_check, EXPR_HOOK);
> +}
>  
> This correctly identifies functions that have been annotated with the __untagged
> annotation that use data from userspace. However, it's not actually that
> useful...
> 
> For example, if I annotate the find_vma function, then it will flag this as a
> potential tagged issue given that its called (from at least some contexts) with
> userspace data. However find_vma is called all over the kernel, and so it's
> difficult to figure out which caller called find_vma with user data.
> 
> It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> 'find_vma' target to look for all the callers where this is a leaf function, but
> this doesn't track the call parameters to exclude paths where the leaf annotated
> function contains user data.

Huh...  I haven't thought about that script in years...  :/

What I do is I download the latest linux-next ever day and rebuild my DB
every day.  Each time you rebuild it, the call tree gets filled out
a bit.  After about a week then the DB is as complete as it is going to
get.

Then I use the smatch_data/db/smdb.py script to figure out the warnings.
I should add an option so that it only shows callers which pass user
data.  Each call site has a unique caller ID.

> I haven't fully understood var_user_rl - but it looks like functions such as
> match_user_assign_function propogate the user_data tag on each run to update
> state, and thus it's probably not easy to get the original source of user
> data after is_user_rl determines its user. Is that correct?

Use the get_user_rl() function instead.  The var_user_rl() is only
supposed to be used internally, to do math.

> 
> I would be great if there was a way to obtain the function that provided the
> original source of user data - then it would be possible to write a script that
> shows the path (call chain) between the two functions - I hoped smatch_scripts/
> call_tree.pl could do this but it doesn't seem to do anything for me.

That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
so I can do CTRL-c to see how a function is called or CTRL-r to see
what it returns

# map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
# map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>

regards,
dan carpenter