On Mon, Nov 27, 2017 at 10:11 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: > On Mon, Nov 27, 2017 at 08:31:07AM -0600, Ashlie Martinez wrote: >> Ted, >> >> Thank you very much for taking the time to lay all of this out for me >> (and throwing some humor and youtube links to boot), despite how busy >> you were (I hope everything is alright!). I see now why the fix works >> and what was going wrong. It appears I was confused about the order of >> operations being performed in the test based on what I read in another >> email. I believe in another email somewhere I read that the fallocate >> was before a delayed write so I was thinking something like fallocate >> then write. I see now that it is write with delayed allocation >> (resolved after fallocate) and then fallocate. With that piece of >> information everything else about the test, delayed allocation, and >> the fix make sense. > > Sorry, "before" was misleading. When I used the word "before", I was > speaking of the order that the operations hit the disk. The confusion > comes from the fact that the delayed allocation write was *issued* > before the fallocate, but in terms of when they are committed to disk, > the fallocate commits *first*, and then 25-30 seconds later, the > delayed allocation write is resolved and then committed to disk. No biggie, part of the reason this was so hard for me to wrap my head around is I don't have a physical machine that I can reproduce this on (and I never got around to getting a GCE instance to test on). Not being able to poke around a reproducing system makes it a little bit harder for me to reason about :) > > It's the difference between the order that the operations are issued > and when they are committed to disk which is what caused the bug; and > the problem reproduction relies on crashing/aborting the file system > between the time that the two operations would have been committed. > > Hopefully this will be helpful in terms of finding a way to create > automated file system testing systems that can detect bugs similar to > this one. I can imagine that if you ever want to extend this to > database testing, a similar technique might be used to detect > transactions which close in a different order than how they were > issued, or dealing transactions which end up getting rolled back. > Vijay and I are hopeful that we can find some reliable way to reproduce this in CrashMonkey. It has also showed us a class of timing bugs that we can't find with the current iteration of CrashMonkey, but we hope we can expand what we have to find them in the future. > - Ted > > P.S. I see you have some Google internships under your belt, so I'm > sure you know the drill, but I hope you'll consider us for another > future internship experience. :-) Haha it's always been nice to be a little bit spoiled while interning there for a summer. I hope I can make way back there for another internship etc. eventually :)