Re: Possible FS race condition between iterate_dir and d_alloc_parallel

"zhengbin (A)" <zhengbin13@xxxxxxxxxx> · Mon, 9 Sep 2019 22:10:00 +0800

On 2019/9/4 14:15, zhengbin (A) wrote:
> On 2019/9/3 23:41, Al Viro wrote:
>
>> On Tue, Sep 03, 2019 at 04:40:07PM +0100, Al Viro wrote:
>>> On Tue, Sep 03, 2019 at 10:44:32PM +0800, zhengbin (A) wrote:
>>>> We recently encountered an oops(the filesystem is tmpfs)
>>>> crash> bt
>>>>  #9 [ffff0000ae77bd60] dcache_readdir at ffff0000672954bc
>>>>
>>>> The reason is as follows:
>>>> Process 1 cat test which is not exist in directory A, process 2 cat test in directory A too.
>>>> process 3 create new file in directory B, process 4 ls directory A.
>>> good grief, what screen width do you have to make the table below readable?
>>>
>>> What I do not understand is how the hell does your dtry2 manage to get actually
>>> freed and reused without an RCU delay between its removal from parent's
>>> ->d_subdirs and freeing its memory.  What should've happened in that
>>> scenario is
>>> 	* process 4, in next_positive() grabs rcu_read_lock().
>>> 	* it walks into your dtry2, which might very well be
>>> just a chunk of memory waiting to be freed; it sure as hell is
>>> not positive.  skipped is set to true, 'i' is not decremented.
>>> Note that ->d_child.next points to the next non-cursor sibling
>>> (if any) or to the ->d_subdir of parent, so we can keep walking.
>>> 	* we keep walking for a while; eventually we run out of
>>> counter and leave the loop.
>>>
>>> Only after that we do rcu_read_unlock() and only then anything
>>> observed in that loop might be freed and reused.
> You are right, I miss this.
>>> Confused...  OTOH, I might be misreading that table of yours -
>>> it's about 30% wider than the widest xterm I can get while still
>>> being able to read the font...
> The table is my guess. This oops happens sometimes
>
> (We have one vmcore, others just have log, and the backtrace is same with vmcore, so the reason should be same).
>
> Unfortunately, we do not know how to reproduce it. The vmcore has such a law:
>
> 1、dirA has 177 files, and it is OK
>
> 2、dirB has 25 files, and it is OK
>
> 3、When we ls dirA, it begins with ".", "..", dirB's first file, second file... last file,  last file->next = &(dirB->d_subdirs)
>
> -------->
>
> crash> struct dir_context ffff0000ae77be30  --->dcache_readdir ctx
>
> struct dir_context {
>
> actor = 0xffff00006727d760 <filldir64>,
>
> pos = 27   --->27 = . + .. + 25 files
>
> }
>
>
> next_positive
>
>   for (p = from->next; p != &parent->d_subdirs; p = p->next)  --->parent is dirA, so will continue
>
>
> This should be a bug, I think it is related with locks,  especially with commit ebaaa80e8f20 ("lockless next_positive()").
>
> Howerver, until now, I do not find the reason, Any suggestions?

They will be a such timing as follows:

1. insert a negative dentryB1 to dirB,  dentryB1->next = dirB's first positive dentry(such as fileB)      d_alloc_parallel-->d_alloc

2.insert a negative dentryB2 to dirB, dentryB2->next = dentryB1                                                        d_alloc_parallel-->d_alloc

3. remove dentryB1 from dirB,  dentryB1->next will be fileB too                                                         d_alloc_parallel->dput(new)

4. alloc dentryB1 to dirA,  dirA's d_subdirs->next will be dentryB1

process 1(ls dirA)                       |  process 2(alloc dentryB1 to dirA: d_alloc_parallel-->d_alloc)

dcache_readdir                          |  d_alloc                             

    p = &dentry->d_subdirs;      |      

    next_positive                         |      

                                                  |       __d_alloc-->INIT_LIST_HEAD(&dentry->d_child)

                                                  |       list_add(&dentry->d_child, &parent->d_subdirs)  --->cpu may be executed out of order, first set parent->d_subdirs->next = dentryB1

        p = from->next                 |

        ---> p will be dentryB1, and dentryB1->next will be fileB

We can solute it in 2 ways:

1. add a smp_wmb between __d_alloc and list_add(&dentry->d_child, &parent->d_subdirs)

2. revert commit ebaaa80e8f20 ("lockless next_positive()")

>> Incidentally, which kernel was that on?
> 4.19-stable,  the code of iterate_dir and d_alloc_parallel is same with master
>> .
>>