Re: [PATCH v2] ceph: Fix kernel crash in generic/397 test

David Howells <dhowells@xxxxxxxxxx> · Tue, 28 Jan 2025 20:01:09 +0000

I added some tracing to fs/ceph/addr.c and this highlights the bug causing the
hang that I'm seeing.

So what I see is ceph_writepages_start() being entered and getting a
collection of folios from filemap_get_folios_tag():

    netfs_ceph_writepages: i=10000004f52 ix=0
    netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6

Then we get out the first dirty folio from the batch and attempt to lock it:

    netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock

which succeeds.  We then pass through a number of lines:

    netfs_ceph_wp_track: i=10000004f52 line=1218

which is the "/* shift unused page to beginning of fbatch */" comment, then:

    netfs_ceph_wp_track: i=10000004f52 line=1238

which is followed by "offset = ceph_fscrypt_page_offset(pages[0]);", then:

    netfs_ceph_wp_track: i=10000004f52 line=1264

which is the error handling path of:

		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
			rc = -EIO;
			goto release_folios;
		}

and then:

    netfs_ceph_wp_track: i=10000004f52 line=1389

which is "release_folios:".

We then reenter ceph_writepages_start(), get the same batch of dirty folios
and try to lock them again:

    netfs_ceph_writepages: i=10000004f52 ix=0
    netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6
    netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock

and that's where we hang.

I think the problem is that the error handling here:

		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
			rc = -EIO;
			goto release_folios;
		}

is insufficient.  The folios are locked and can't just be released.

Why ceph_inc_osd_stopping_blocker() fails is also something that needs looking
at.

David