On Thu, Jul 21, 2016 at 11:43:08PM +0100, Mark Brown wrote: > On Thu, Jul 21, 2016 at 05:40:15PM -0400, Rich Felker wrote: > > > The spi master device has a "bus lock flag" that's used in lieu of a > > recursive mutex, but it seems to be used entirely incorrectly. The > > logic I'm seeing in spi.c is essentially: > > I am entirely unclear why you believe there is a recursive mutex here, > that's not the purpose of the spi_bus_lock() API at all. I consider the "if not yet locked, take lock" pattern logically akin to a recursive mutex, but apologies if the language was imprecise. > > if (some_task_has_spi_bus_locked) > > use_spi_master_without_locking(); > > > There seems to be no check that the task that's using the master > > without locking is the same one that locked it! > > That's because there is no requirement that something that takes a bus > lock be executing from a single context. It is just granting exclusive > access to the bus, the caller is free to do those accesses in any manner > it sees fit. Well then what exactly is the contract for who can access it and who can't? What I'm experiencing is that, when the driver for one device on the spi bus has the bus locked, other devices access the bus with no synchronization at all, resulting in serious data corruption. The media-change-detect probe also does this even with a single mmc device on the bus. Reverting commits 24c8cd1b0812, 49023d2e4ead, and 556351f14e74 (the first two revert cleanly; the latter I did by hand) makes the problem go away. > > Also, master->bus_lock_flag is modified completely unsynchonized from > > read access to master->bus_lock_flag. There is no locking of > > master->bus_lock_mutex or master->bus_lock_spinlock around the read > > accesses to master->bus_lock_flag, and in one place the write (of 0) > > to master->bus_lock_flag takes place without holding the spinlock. > > The critical read access is in spi_async where we exclude users that > don't hold the lock which *is* spinlocked; we should probably hold the > spinlock on release but it's not so important since we know that we hold > the lock when we call it and the mutex ends up providing protection. > spi_sync() shouldn't be reading the flag in the first place as far as I > can tell, it should unconditionally have the flag clear as locked > callers have to use spi_sync_locked() but that will give us a fast path > that doesn't check for the bus lock which starts to reveal the > underlying issue. Ah, in that case maybe the actual bug is just one line of commit 24c8cd1b0812? Reverting just that also seems to make the problem go away. > This underlying issue is that we are trying to use one mutex for two > purposes, the existing mutex is mainly being used to serialize access to > the physical bus but the externally visible bus lock is there to ensure > that only one caller is able to queue new transfers which is not the > same thing at all. It is completely irrelevant when we are pushing > messages to the bus if we are doing so with the bus lock held. OK. I didn't understand what synchronization prevents messages from multiple drivers/sources from getting pushed at the same time, but apparently the problem was just that spi_sync was behaving like spi_sync_locked when the bus was already locked. > > The obvious fix would be using an actual recursive mutex to obtain > > recursive mutex semantics. > > This would be broken since we need to be able to start transfers from > atomic context. *Nod*, OK. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html