On Tue, 19 Feb 2019 21:49:07 -0500 Eric Farman <farman@xxxxxxxxxxxxx> wrote: > Hi Connie, Farhan, > > On 02/04/2019 12:06 PM, Cornelia Huck wrote: > > From: Farhan Ali <alifm@xxxxxxxxxxxxx> > > > > When trying to calculate the length of a ccw chain, we assume > > there are ccws after a TIC. This can lead to overcounting and > > copying garbage data from guest memory. > > > > Signed-off-by: Farhan Ali <alifm@xxxxxxxxxxxxx> > > Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@xxxxxxxxxxxxx> > > Reviewed-by: Halil Pasic <pasic@xxxxxxxxxxxxx> > > Signed-off-by: Cornelia Huck <cohuck@xxxxxxxxxx> > > --- > > drivers/s390/cio/vfio_ccw_cp.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c > > index 70a006ba4d05..ba08fe137c2e 100644 > > --- a/drivers/s390/cio/vfio_ccw_cp.c > > +++ b/drivers/s390/cio/vfio_ccw_cp.c > > @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp) > > return -EOPNOTSUPP; > > } > > > > - if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw))) > > + if (!ccw_is_chain(ccw)) > > break; > > > > ccw++; > > > > Sigh. Let me state up front that I'm running vanilla 5.0-rc7 host > kernel with this patch applied (or not), and QEMU from a week or two ago. > > I stated in another thread that this patch makes things better when > running fio and other random exercisers within my guest. But I was > always booting off a virtio-blk disk when I did that, and using some > additional disks connected by vfio-ccw. Today I discovered that this > patch prevents guest boot from vfio-ccw (which was working previously): > > /usr/local/bin/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -smp > 4 -m 1024 -nographic -s -net none -device > vfio-ccw,sysfsdev=/sys/class/mdev_bus/0.0.0cbb/9503d38c-b83a-45b1-a2f0-393739817786,devno=fe.0.0e1e,force-orb-pfch=true,bootindex=0 > -bios /usr/src/qemu/build/pc-bios/s390-ccw/s390-ccw.img > LOADPARM=[ ] > vfio-ccw device I/O error - Interrupt Response Block Data: > Function Ctrl : [Start] > Activity Ctrl : > Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending] > Device Status : [Channel-End] [Device-End] > Channel Status : [Incorrect-Length] > cpa=: 0x0000000000000018 > prev_ccw=: 0x3100003840000008 > this_ccw=: 0x0800001000000000 > Eckd Dasd Sense Data (fmt 32-bytes): > Sense Condition Flags : > Residual Count =: 0x0000000000000000 > Phys Drive ID =: 0x0000000000000000 > low cyl address =: 0x0000000000000000 > head addr & hi cyl =: 0x0000000000000000 > format/message =: 0x0000000000000000 > fmt-dependent[0-7] =: 0x0000000000000000 > fmt-dependent[8-15]=: 0x0000000005160f00 > prog action code =: 0x0000000000000000 > Configuration info =: 0x00000000000040e0 > mcode / hi-cyl =: 0x0000000000000000 > cyl & head addr [0]=: 0x0000000000000000 > cyl & head addr [1]=: 0x0000000000000000 > cyl & head addr [2]=: 0x0000000000000000 > ...snip... > > Reverting this patch allows the boot to proceed normally. So, ugh. Agreed on the 'ugh' :( > > I'm not going to claim to know exactly how this is failing, because it's > too late for coffee and reading IPL records is difficult enough. But in > the working case (without this patch), cp_init() calls > ccwchain_calc_length() and gets a count of four CCWs (SEEK + SIDE + TIC > + READ), which is broken up into two chains. One for the SEEK/SIDE/TIC, > and another for the READ. In the failing case (with this patch), > cp_init() only counts three CCWs, we never read the IPL2 address (the > second chain in the working case), and our boot fails. > > Unsure what the next best move is here. It is possible that the > (ill-used) check that Farhan noticed and removed was actually added to > "make boot work" when handling recursive TIC CCWs such as this. But I > still agree that this patch is correct, even though it exposes more > problems with our TIC handling throughout this code at large. Any thoughts? Yes, there's probably something rotten in our TIC handling... Just to make things clear: This patch makes doing I/O on a booted system more stable, but makes the not-yet-merged QEMU bios boot code fail, right? I'll guess I'll stare at the ccw translation code for a bit and see if something jumps out at me...