> From: hch@xxxxxx [mailto:hch@xxxxxx] > Sent: Tuesday, February 14, 2017 22:51 > To: Dexuan Cui <decui@xxxxxxxxxxxxx> > Cc: hch@xxxxxx; Jens Axboe <axboe@xxxxxxxxx>; Bart Van Assche > <Bart.VanAssche@xxxxxxxxxxx>; hare@xxxxxxxx; hare@xxxxxxx; Martin K. > Petersen <martin.petersen@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > linux-block@xxxxxxxxxxxxxxx; jth@xxxxxxxxxx; Nick Meier > <Nick.Meier@xxxxxxxxxxxxx>; Alex Ng (LIS) <alexng@xxxxxxxxxxxxx>; Long Li > <longli@xxxxxxxxxxxxx>; Adrian Suhov (Cloudbase Solutions SRL) <v- > adsuho@xxxxxxxxxxxxx>; Chris Valean (Cloudbase Solutions SRL) <v- > chvale@xxxxxxxxxxxxx> > Subject: Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock > when scheduling workqueue elements") > > On Tue, Feb 14, 2017 at 02:46:41PM +0000, Dexuan Cui wrote: > > > From: hch@xxxxxx [mailto:hch@xxxxxx] > > > Sent: Tuesday, February 14, 2017 22:29 > > > To: Dexuan Cui <decui@xxxxxxxxxxxxx> > > > Subject: Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event > lock > > > when scheduling workqueue elements") > > > > > > Ok, thanks for testing. Can you try the patch below? It fixes a > > > clear problem which was partially papered over before the commit > > > you bisected to, although it can't explain why blk-mq still works. > > > > Still bad luck. :-( > > > > BTW, I'm using the first "bad" commit (scsi: allocate scsi_cmnd structures > as > > part of struct request) + the 2 patches you provided today. > > > > I suppose I don't need to test the 2 patches on the latest linux-next repo. > > I'd love a test on that repo actually. We had a few other for sense > handling since then I think. I tested today's linux-next (next-20170214) + the 2 patches just now and got a weird result: sometimes the VM stills hung with a new calltrace (BUG: spinlock bad magic) , but sometimes the VM did boot up despite the new calltrace! Attached is the log of a "good" boot. It looks we have a memory corruption issue somewhere... Actually previously I saw the "BUG: spinlock bad magic" message once, but I couldn't repro it later, so I didn't mention it to you. The good news is that now I can repro the "spinlock bad magic" message every time. I tried to dig into this by enabling Kernel hacking -> Memory debugging, but didn't find anything abnormal. Is it possible that the SCSI layer passes a wrong memory address? Thanks, -- Dexuan
Attachment:
dmesg.log
Description: dmesg.log