On Aug 14, 2024 / 18:05, Nilay Shroff wrote: > > > On 8/13/24 12:36, Yi Zhang wrote: > > On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay@xxxxxxxxxxxxx> wrote: > > > > There are no simultaneous tests during the CKI tests running. > > I reproduced the failure on that server and always can be reproduced > > within 5 times: > > # sh a.sh > > ==============================0 > > nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed] > > runtime 21.496s ... 21.398s > > ==============================1 > > nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed] > > runtime 21.398s ... 21.974s > > --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400 > > +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad > > 2024-08-13 02:53:51.635047928 -0400 > > @@ -1,2 +1,5 @@ > > Running nvme/052 > > +cat: /sys/block/nvme1n2/uuid: No such file or directory > > +cat: /sys/block/nvme1n2/uuid: No such file or directory > > +cat: /sys/block/nvme1n2/uuid: No such file or directory > > Test complete > > # uname -r > > 6.11.0-rc3 > > We may need to debug this further. Is it possible to patch blktest and > collect some details when this issue manifests? If yes then can you please > apply the below diff and re-run your test? This patch would capture output > of "nvme list" and "sysfs attribute tree created under namespace head node" > and store those details in 052.full file. > > diff --git a/common/nvme b/common/nvme > index 9e78f3e..780b5e3 100644 > --- a/common/nvme > +++ b/common/nvme > @@ -589,8 +589,23 @@ _find_nvme_ns() { > if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then > continue > fi > + echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL} > + echo -e "\n`nvme list -v`\n" >> ${FULL} > + echo -e "\n`tree ${ns}`\n" >> ${FULL} > + > [ -e "${ns}/uuid" ] || continue > uuid=$(cat "${ns}/uuid") > + > + if [ "$?" = "1" ]; then > + echo -e "\nFailed to read $ns/uuid\n" >> ${FULL} > + echo "`nvme list -v`" >> ${FULL} > + if [ -d "${ns}" ]; then > + echo -e "\n`tree ${ns}`\n" >> ${FULL} > + else > + echo -e "\n${ns} doesn't exist!\n" >> ${FULL} > + fi > + fi > + > if [[ "${subsys_uuid}" == "${uuid}" ]]; then > basename "${ns}" > fi > > > After applying the above diff, when this issue occurs on your system copy this > file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across. > This may give us some clue about what might be going wrong. Nilay, thank you for this suggestion. To follow it, I tried to recreate the failure again, and managed to do it :) When I repeat the test case 20 or 40 times one of my test machines, the failure is observed in stable manner. I applied your debug patch above to blktests, then I repeated the test case. Unfortunately, the failure disappeared. When I repeat the test case 100 times, the failure was not observed. I guess the echos for debug changed the timing to access the sysfs uuid file, then the failure disappeared. This helped me think about the cause. The test case repeats _create_nvmet_ns and _remove_nvmet_ns. Then, it repeats creating and removing the sysfs uuid file. I guess when _remove_nvmet_ns echos 0 to ${nvemt_ns_path}/enable to remove the namespace, it does not wait for the completion of the removal work. Then, when _find_nvme_ns() checks existence of the sysfs uuid file, it refers to the sysfs uuid file that the previous _remove_nvmet_ns left. When it does cat to the sysfs uuid file, it fails because the sysfs uuid file has got removed, before recreating it for the next _create_nvmet_ns. Based on this guess, I created a patch below. It modifies the test case to wait for the namespace device disappears after calling _remove_nvmet_ns. (I assume that the sysfs uuid file disappears when the device file disappears). With this patch, the failure was not observed by repeating it 100 times. I also reverted the kernel commit ff0ffe5b7c3c ("nvme: fix namespace removal list") from v6.11-rc4, then confirmed that the test case with this change still can detect the regression. I will do some more confirmation. If it goes well, will post this change as a formal patch. diff --git a/tests/nvme/052 b/tests/nvme/052 index cf6061a..469cefd 100755 --- a/tests/nvme/052 +++ b/tests/nvme/052 @@ -39,15 +39,32 @@ nvmf_wait_for_ns() { ns=$(_find_nvme_ns "${uuid}") done + echo "$ns" return 0 } +nvmf_wait_for_ns_removal() { + local ns=$1 i + + for ((i = 0; i < 10; i++)); do + if [[ ! -e /dev/$ns ]]; then + return + fi + sleep .1 + echo "wait removal of $ns" >> "$FULL" + done + + if [[ -e /dev/$ns ]]; then + echo "Failed to remove the namespace $" + fi +} + test() { echo "Running ${TEST_NAME}" _setup_nvmet - local iterations=20 + local iterations=20 ns _nvmet_target_setup @@ -63,7 +80,7 @@ test() { _create_nvmet_ns "${def_subsysnqn}" "${i}" "$(_nvme_def_file_path).$i" "${uuid}" # wait until async request is processed and ns is created - nvmf_wait_for_ns "${uuid}" + ns=$(nvmf_wait_for_ns "${uuid}") if [ $? -eq 1 ]; then echo "FAIL" rm "$(_nvme_def_file_path).$i" @@ -71,6 +88,7 @@ test() { fi _remove_nvmet_ns "${def_subsysnqn}" "${i}" + nvmf_wait_for_ns_removal "$ns" rm "$(_nvme_def_file_path).$i" } done