Re: [PATCH 0/3] sparc: port to copy_thread_tls() and struct kernel_clone_args

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21/05/2020 20:08, Mark Cave-Ayland wrote:

> On 19/05/2020 01:24, Al Viro wrote:
> 
>>>> If you able to reproduce the issue consistently and can help figure out what's going
>>>> on then that would be a great help. Perhaps it might make sense to split this into a
>>>> separate thread and drop the non-sparc lists?
>>>
>>> Sure, no problem.As for "able to reproduce"
>>   -generally takes under half an hour.
>>> Less in this case, as you can see from printk timestamps...
>>
>> FWIW, right after boot
>> root@sparc64:/tmp# wget http://ftp.us.debian.org/debian/pool/main/l/linux/linux_5.7~rc5.orig.tar.xz
>> --2020-05-18 19:23:31--  http://ftp.us.debian.org/debian/pool/main/l/linux/linux_5.7~rc5.orig.tar.xz
>> Resolving ftp.us.debian.org (ftp.us.debian.org)... 208.80.154.15, 64.50.233.100, 64.50.236.52, ...
>> Connecting to ftp.us.debian.org (ftp.us.debian.org)|208.80.154.15|:80... connected.
>> HTTP request sent, awaiting response... 200 OK
>> Length: 117279780 (112M) [application/octet-stream]
>> Saving to: ‘linux_5.7~rc5.orig.tar.xz’
>>
>>           linux_5.7   0%[                    ]       0  --.-KB/s               [  216.454929] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> .tar.xz              63%[===========>        ]  71.36M  5.32MB/s    eta 9s     [  261.490162] ata1: lost interrupt (Status 0x50)
>> [  261.491467] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> [  261.492164] ata1.00: failed command: FLUSH CACHE
>> [  261.492773] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> [  261.492773]          res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
>> [  261.493920] ata1.00: status: { DRDY }
>> [  261.494587] ata1: soft resetting link
>> [  261.495030] ata2: lost interrupt (Status 0x58)
>> [  261.658539] ata1.00: configured for UDMA/33
>> [  261.658987] ata1.00: retrying FLUSH 0xe7 Emask 0x4
>> [  266.854943] ata2.00: qc timeout (cmd 0xa0)
>> [  266.855567] ata2.00: TEST_UNIT_READY failed (err_mask=0x5)
>> [  272.229617] ata2.00: qc timeout (cmd 0xa0)
>> [  272.230028] ata2.00: TEST_UNIT_READY failed (err_mask=0x5)
>> [  272.230851] ata2.00: limiting speed to UDMA/33:PIO3
>> <similar to earlier, this time with fs errors - different ATA command failing>
>>
>> When writing *not* to disk:
>> root@sparc64:~# mount -t ramfs none /tmp
>> root@sparc64:~# cd /tmp/
>> root@sparc64:/tmp# wget http://ftp.us.debian.org/debian/pool/main/l/linux/linux_5.7~rc5.orig.tar.xz
>> --2020-05-18 19:39:58--  http://ftp.us.debian.org/debian/pool/main/l/linux/linux_5.7~rc5.orig.tar.xz
>> Resolving ftp.us.debian.org (ftp.us.debian.org)... 208.80.154.15, 64.50.236.52, 64.50.233.100, ...
>> Connecting to ftp.us.debian.org (ftp.us.debian.org)|208.80.154.15|:80... connected.
>> HTTP request sent, awaiting response... 200 OK
>> Length: 117279780 (112M) [application/octet-stream]
>> Saving to: ‘linux_5.7~rc5.orig.tar.xz’
>>
>>          linux_5.7~   0%[                    ] 918.40K  4.38MB/s               [   82.810156] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> [   82.830163] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> [   82.832862] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> [   82.853928] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> inux_5.7~rc5.orig.t   3%[                    ]   3.72M  1.63MB/s               [   84.860985] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> [   84.878113] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> [   84.886409] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> 7~rc5.orig.tar.xz     6%[>                   ]   7.09M  1.54MB/s    eta 46s    [  118.099900] ata2: lost interrupt (Status 0x58)
>> [  122.195865] ata1: lost interrupt (Status 0x50)
>> [  122.197426] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> [  122.199156] ata1.00: failed command: WRITE DMA
>> [  122.200488] ata1.00: cmd ca/00:08:ac:fb:46/00:00:00:00:00/e0 tag 0 dma 4096 out
>> [  122.200488]          res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
>> [  122.203720] ata1.00: status: { DRDY }
>> [  122.204870] ata1: soft resetting link
>> [  122.365836] ata1.00: configured for UDMA/33
>> [  122.367282] ata1: EH complete
>> [  123.463696] ata2.00: qc timeout (cmd 0xa0)
>> [  123.464129] ata2.00: TEST_UNIT_READY failed (err_mask=0x5)
>> [  128.839650] ata2.00: qc timeout (cmd 0xa0)
>> [  128.840747] ata2.00: TEST_UNIT_READY failed (err_mask=0x5)
>> [  128.842261] ata2.00: limiting speed to UDMA/33:PIO3
>> [  134.215584] ata2.00: qc timeout (cmd 0xa0)
>> [  134.215995] ata2.00: TEST_UNIT_READY failed (err_mask=0x5)
>> [  134.216499] ata2.00: disabled
>> <usual series of bus resets, with complaints about jbd2 locked for too long, etc.; IO on
>> /var/log/exim4/mainlog, of all things>
>>
>> Very interesting...  The same with exim4 and sshd stopped passes with
>> lots of "out of receive descriptors", but without a hang.  The same with
>> ssh started: ditto.  Start exim4, repeat - still no hang.  Try to do
>> the same wget with md5sum /usr/bin/* at the same time from ssh session -
>> lost interrupt and a hang.  Actually, it wasn't even md5sum - tab
>> completion in bash has done it.
>>
>> Next experiment: boot, then
>> root@sparc64:~# service exim4 stop
>> Stopping MTA: exim4_listener.
>> root@sparc64:~# service ssh stop
>> Stopping OpenBSD Secure Shell server: sshd.
>> root@sparc64:~# mount -t ramfs none /tmp
>> root@sparc64:~# cd /tmp/
>> root@sparc64:/tmp# (sleep 2; md5sum /usr/bin/* >/dev/null) &
>> [1] 1126
>> root@sparc64:/tmp# wget http://ftp.us.debian.org/debian/pool/main/l/linux/linux_5.7~rc5.orig.tar.xz
>> --2020-05-18 20:17:18--  http://ftp.us.debian.org/debian/pool/main/l/linux/linux_5.7~rc5.orig.tar.xz
>> Resolving ftp.us.debian.org (ftp.us.debian.org)... 64.50.236.52, 64.50.233.100, 208.80.154.15, ...
>> Connecting to ftp.us.debian.org (ftp.us.debian.org)|64.50.236.52|:80... connected.
>> HTTP request sent, awaiting response... 200 OK
>> Length: 117279780 (112M) [application/x-xz]
>> Saving to: ‘linux_5.7~rc5.orig.tar.xz’
>>
>>          linux_5.7~   0%[                    ] 535.49K  2.58MB/s               [  142.491757] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>>        linux_5.7~rc   2%[                    ]   2.92M  4.85MB/s               [  142.815354] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> [  142.843435] enp2s1: Happy Meal out of receive descriptors, packet dropped.
>> ux_5.7~rc5.orig.tar   6%[>                   ]   7.18M  2.72MB/s               [  175.465117] ata1: lost interrupt (Status 0x50)
>> <hang>
>>
>> So it does look like hme alone is not enough, but it makes cmd64x lost interrupt happen
>> much faster.  Note that this time no tab completion, etc. had been involved - straight
>> reads (well, and atime touches) done by md5sum in background.
>>
>> No repeats of that iommu.c BUG_ON() so far...  Ideas?
> 
> Thanks for the reproducer, however I'm still scratching my head trying to reproduce
> the error here:
> 
> $ ./qemu-system-sparc64 -hda /home/build/src/qemu/image/sparc64/deb90.qcow2 -snapshot
> -m 1024 -nographic
> 
> root@debian90:~# uname -a
> Linux debian90 4.9.0-3-sparc64 #1 Debian 4.9.30-2 (2017-06-12) sparc64 GNU/Linux
> root@debian90:~# mount -t ramfs none /tmp
> root@debian90:~# cd /tmp/
> root@debian90:/tmp# (sleep 2; md5sum /usr/bin/* >/dev/null) &
> [1] 358
> _5.7~rc5.orig.tar.xzwget http://ftp.us.debian.org/debian/pool/main/l/linux/linux_
> --2020-05-21 20:02:49--
> http://ftp.us.debian.org/debian/pool/main/l/linux/linux_5.7~rc5.orig.tar.xz
> Resolving ftp.us.debian.org (ftp.us.debian.org)... 64.50.236.52, 64.50.233.100,
> 208.80.154.15, ...
> Connecting to ftp.us.debian.org (ftp.us.debian.org)|64.50.236.52|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 117279780 (112M) [application/x-xz]
> Saving to: ‘linux_5.7~rc5.orig.tar.xz’
> 
> linux_5.7~rc5.orig. 100%[===================>] 111.85M  2.32MB/s    in 49s
> 
> 2020-05-21 20:03:38 (2.26 MB/s) - ‘linux_5.7~rc5.orig.tar.xz’ saved [117279780/117279780]
> 
> [1]+  Done                    ( sleep 2; md5sum /usr/bin/* > /dev/null )
> root@debian90:/tmp#
> 
> 
> Can you tell me a bit more about the host in terms of CPU and disk to help figure out
> what's going on?

One other thought I had is that somehow the IVEC IRQs are managing to be overwritten
on a faster host before being read by the guest. Does the following patch display the
FATAL message at the point where things hang?


diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c
index fae20ee97c..618ebd1300 100644
--- a/hw/pci-host/sabre.c
+++ b/hw/pci-host/sabre.c
@@ -63,6 +63,9 @@
 static inline void sabre_set_request(SabreState *s, unsigned int irq_num)
 {
     trace_sabre_set_request(irq_num);
+    if (s->irq_request != 0 && s->irq_request != NO_IRQ_REQUEST) {
+        fprintf(stderr, "FATAL: still waiting for IRQ %x, now %x\n", s->irq_request,
irq_num);
+    }
     s->irq_request = irq_num;
     qemu_set_irq(s->ivec_irqs[irq_num], 1);
 }


ATB,

Mark.



[Index of Archives]     [Kernel Development]     [DCCP]     [Linux ARM Development]     [Linux]     [Photo]     [Yosemite Help]     [Linux ARM Kernel]     [Linux SCSI]     [Linux x86_64]     [Linux Hams]

  Powered by Linux