On Mon, Aug 19, 2024 at 9:35 PM Nilay Shroff <nilay@xxxxxxxxxxxxx> wrote: > > > > On 8/19/24 18:04, Shinichiro Kawasaki wrote: > > On Aug 14, 2024 / 18:05, Nilay Shroff wrote: > >> > >> > >> On 8/13/24 12:36, Yi Zhang wrote: > >>> On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay@xxxxxxxxxxxxx> wrote: > >>> > >>> There are no simultaneous tests during the CKI tests running. > >>> I reproduced the failure on that server and always can be reproduced > >>> within 5 times: > >>> # sh a.sh > >>> ==============================0 > >>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed] > >>> runtime 21.496s ... 21.398s > >>> ==============================1 > >>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed] > >>> runtime 21.398s ... 21.974s > >>> --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400 > >>> +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad > >>> 2024-08-13 02:53:51.635047928 -0400 > >>> @@ -1,2 +1,5 @@ > >>> Running nvme/052 > >>> +cat: /sys/block/nvme1n2/uuid: No such file or directory > >>> +cat: /sys/block/nvme1n2/uuid: No such file or directory > >>> +cat: /sys/block/nvme1n2/uuid: No such file or directory > >>> Test complete > >>> # uname -r > >>> 6.11.0-rc3 > >> > >> We may need to debug this further. Is it possible to patch blktest and > >> collect some details when this issue manifests? If yes then can you please > >> apply the below diff and re-run your test? This patch would capture output > >> of "nvme list" and "sysfs attribute tree created under namespace head node" > >> and store those details in 052.full file. > >> > >> diff --git a/common/nvme b/common/nvme > >> index 9e78f3e..780b5e3 100644 > >> --- a/common/nvme > >> +++ b/common/nvme > >> @@ -589,8 +589,23 @@ _find_nvme_ns() { > >> if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then > >> continue > >> fi > >> + echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL} > >> + echo -e "\n`nvme list -v`\n" >> ${FULL} > >> + echo -e "\n`tree ${ns}`\n" >> ${FULL} > >> + > >> [ -e "${ns}/uuid" ] || continue > >> uuid=$(cat "${ns}/uuid") > >> + > >> + if [ "$?" = "1" ]; then > >> + echo -e "\nFailed to read $ns/uuid\n" >> ${FULL} > >> + echo "`nvme list -v`" >> ${FULL} > >> + if [ -d "${ns}" ]; then > >> + echo -e "\n`tree ${ns}`\n" >> ${FULL} > >> + else > >> + echo -e "\n${ns} doesn't exist!\n" >> ${FULL} > >> + fi > >> + fi > >> + > >> if [[ "${subsys_uuid}" == "${uuid}" ]]; then > >> basename "${ns}" > >> fi > >> > >> > >> After applying the above diff, when this issue occurs on your system copy this > >> file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across. > >> This may give us some clue about what might be going wrong. > > > > Nilay, thank you for this suggestion. To follow it, I tried to recreate the > > failure again, and managed to do it :) When I repeat the test case 20 or 40 > > times one of my test machines, the failure is observed in stable manner. > > Shinichiro, I am glad that you were able to recreate this issue. > > > I applied your debug patch above to blktests, then I repeated the test case. > > Unfortunately, the failure disappeared. When I repeat the test case 100 times, > > the failure was not observed. I guess the echos for debug changed the timing to > > access the sysfs uuid file, then the failure disappeared. > > Yes this could be possible. BTW, Yi tried the same patch and with the patch applied, > this issue could be still reproduced on Yi's testbed!! > > This helped me think about the cause. The test case repeats _create_nvmet_ns > > and _remove_nvmet_ns. Then, it repeats creating and removing the sysfs uuid > > file. I guess when _remove_nvmet_ns echos 0 to ${nvemt_ns_path}/enable to > > remove the namespace, it does not wait for the completion of the removal work. > > Then, when _find_nvme_ns() checks existence of the sysfs uuid file, it refers to > > the sysfs uuid file that the previous _remove_nvmet_ns left. When it does cat > > to the sysfs uuid file, it fails because the sysfs uuid file has got removed, > > before recreating it for the next _create_nvmet_ns. > > I agree with your assessment about the plausible cause of this issue. I just reviewed > the nvme target kernel code and it's now apparent to me that we need to wait for the > removal of the namespace before we re-create the next namespace. I think this is a miss. > > > > Based on this guess, I created a patch below. It modifies the test case to wait > > for the namespace device disappears after calling _remove_nvmet_ns. (I assume > > that the sysfs uuid file disappears when the device file disappears). With > > this patch, the failure was not observed by repeating it 100 times. I also > > reverted the kernel commit ff0ffe5b7c3c ("nvme: fix namespace removal list") > > from v6.11-rc4, then confirmed that the test case with this change still can > > detect the regression. > > > I am pretty sure that your patch would solve this issue. > > > I will do some more confirmation. If it goes well, will post this change as > > a formal patch. > > > > diff --git a/tests/nvme/052 b/tests/nvme/052 > > index cf6061a..469cefd 100755 > > --- a/tests/nvme/052 > > +++ b/tests/nvme/052 > > @@ -39,15 +39,32 @@ nvmf_wait_for_ns() { > > ns=$(_find_nvme_ns "${uuid}") > > done > > > > + echo "$ns" > > return 0 > > } > > > > +nvmf_wait_for_ns_removal() { > > + local ns=$1 i > > + > > + for ((i = 0; i < 10; i++)); do > > + if [[ ! -e /dev/$ns ]]; then > > + return > > + fi > > + sleep .1 > > + echo "wait removal of $ns" >> "$FULL" > > + done > > + > > + if [[ -e /dev/$ns ]]; then > > + echo "Failed to remove the namespace $" > > + fi > > +} > > + > > test() { > > echo "Running ${TEST_NAME}" > > > > _setup_nvmet > > > > - local iterations=20 > > + local iterations=20 ns > > > > _nvmet_target_setup > > > > @@ -63,7 +80,7 @@ test() { > > _create_nvmet_ns "${def_subsysnqn}" "${i}" "$(_nvme_def_file_path).$i" "${uuid}" > > > > # wait until async request is processed and ns is created > > - nvmf_wait_for_ns "${uuid}" > > + ns=$(nvmf_wait_for_ns "${uuid}") > > if [ $? -eq 1 ]; then > > echo "FAIL" > > rm "$(_nvme_def_file_path).$i" > > @@ -71,6 +88,7 @@ test() { > > fi > > > > _remove_nvmet_ns "${def_subsysnqn}" "${i}" > > + nvmf_wait_for_ns_removal "$ns" > > rm "$(_nvme_def_file_path).$i" > > } > > done > > I think there's some formatting issue in the above patch. I see some stray characters > which you may cleanup/fix later when you send the formal patch. > > Yi, I think you you may also try the above patch on your testbed and confirm the result. Nilay/Shinichiro Confirmed the failure cannot be reproduced with this patch now. > > Thanks, > --Nilay > -- Best Regards, Yi Zhang