On Wed 13-10-21 22:31:37, Theodore Ts'o wrote: > On Thu, Sep 23, 2021 at 08:12:04PM +0800, xueqingwen wrote: > > .... > > Therefore, the handle was delayed to start until finding the pages that > > need mapping in ext4_writepages(). With this patch, the above problem did > > not recur. We had looked this patch over pretty carefully, but another pair > > of eyes would be appreciated. Please help to review whether there are > > defects and whether it can be merged to upstream. > > Hi, > > I've tried tests against this patch, and it's causing a large number > of hangs. For most of the hangs, it's while running generic/269, > although there were a few other tests which would cause the kernel to > hang. > > I don't have time to try to figure out why your patch might be > failing, at least not this week. So if you could take a look at at > the test artifiacts in this xz compressed tarfile, I'd appreciate it. > The "report" file contains a summary report, and the *.serial files > contain the output from the serial console of the VM's which were > hanging with your patch applied. Perhaps you can determine what needs > to be fixed to prevent the kernel hangs? Well, I guess the problem is that proper lock ordering is transaction start -> page lock and this patch inverts it so it creates all sorts of deadlock possibilities. Lockdep will not catch this problem because page lock is not tracked by it. I do understand the problem description but this just isn't a viable solution to it. There are some possible solutions like locking the first page outside of transaction, then unlocking it, starting a transaction and then only trylocking pages in mpage_prepare_extent_to_map() but it tends to result in pretty ugly code. Also we'd need to make sure we don't call submit_bio() when having transaction started (as that is where throttling happens) - any such place may cause described latency problems. It's going to be rather difficult to find and address all such places. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR