On Wed, 19 Jun 2013, Dave Chinner wrote: > On Tue, Jun 18, 2013 at 08:12:01PM -0700, Sage Weil wrote: > > On Wed, 19 Jun 2013, Dave Chinner wrote: > > > On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote: > > > > Hi guys, > > > > > > > > I reproduced this on two more boxes and have more data. The full set of > > > > notes/logs is at > > > > > > > > http://newdream.net/~sage/bug-4976/notes.txt > > > > > > case c: > > > > > > /var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3: > > > 0: [0..7]: 244206232..244206239 > > > 1: [8..1351]: hole > > > 2: [1352..2431]: 244252824..244253903 > > > 3: [2432..3255]: hole > > > 4: [3256..4855]: 244254728..244256327 > > > > > > bad data starts at offset 1179648: > > > > > > Which lies within the middle of an allocated extent (offset 2304bb). > > > > > > IIUC, then there was a write at offset 700466 for 445465 bytes, > > > (i.e start @ 1368bb, end @ 2239bb), but given the block count of > > > the file didn't change, this must have been an overwrite of existing > > > data. It's well within EOF, too, so it's not clear what has happened > > > here - the bad data was not written by the ceph journal replay, and > > > the extent was already allocated on disk... > > > > > > case d: > > > > > > 0: [0..7]: 732632192..732632199 > > > 1: [8..783]: hole > > > 2: [784..2943]: 733513808..733515967 > > > 3: [2944..3511]: hole > > > 4: [3512..4703]: 733516536..733517727 > > > > > > > > > INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648 (((same offset as previous case!!!)) > > > > > > So, similar layout, we have a write at offset 404703 for 773518 > > > bytes (start @790bb, end @2302). Ok, that makes a little more sense > > > to have corruption starting @ 2304. The pages in the page cache that > > > cover this write would cover offsets 784bb to 2303bb, having zeroed > > > the head and tail sections. > > > > > > The file size and number of blocks didn't change again and it's > > > well within the file size, so this is also an overwrite without > > > allocat. IOWs the overwrite ended at 2303bb, with corruption > > > starting at 2304bb. > > > > > > Let's play a "what-if" game. there are 3 writes to the file: > > > > > > -@ 0 for 57 bytes (1FSB) > > > -@ 404703 for 773518 bytes ( > > > -@ 1801584 for 603119 bytes > > > > > > These translate to the following regions in BB: > > > > > > - [0..1] > > > - [790..2302] > > > - [3518..4697] > > > > > > If we round these to filesystem blocks, we have: > > > > > > - [0..7] > > > - [784..2303] > > > - [3512..4703] > > > > > > And in filesystem blocks: > > > > > > Should have Actually have > > > ----------- ------------ > > > - [0..1] - [0..1] > > > - [98..287] - [98..367] > > > - [439..587] - [439..587] > > > > > > So there's an extra 80 filesystem blocks in the middle extent. Does > > > that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 = > > > 340fsb. it doesn't, so there's another 60fsb beyond EOF. > > > > > > And that's pretty telling. There's non-zero'd speculative > > > preallocation there. The *big* question is this: why was the file > > > size extented (i.e. user data IO completed) and updated and the > > > file size update logged to the journal *before* the zeroed regions > > > of the file had been written to zero? > > > > > > And why only on 3.8/3.9? > > > > I haven't gone back and tested on older kernels. It's possible this isn't > > a new problem and previous failures were lost in the noise. :/ > > > > > > Any ideas? > > > > > > I'd love to see the code that is writing the files in the first > > > place - it's not playing silly buggers with sync_file_range(), is > > > it? > > > > Actually, it is! > > > > ... > > ::sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE); > > ... > > int fa_r = posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED); > > > > ...and a call to close(2). > > <sigh> > > > The intent was just to push things out of the disk so that a syncfs() > > that comes later will take less time. I wouldn't have expected this to > > change the commit ordering semantics, though; only to initiate writeback > > sooner. > > It initiates writeback on certain ranges of the file before others, > and so therefore can expose issues related to ordering of writeback. > It's made worse by the fact that, by definition, both > sync_file_range() and posix_fadvise(POSIX_FADV_DONTNEED) provide no > data integrity guarantees. And, further, neither are vectored > through the filesystem like fsync is and so the filesystem can't > provide any integrity guarantees, either. Sigh... well that explains that. I realize that these calls don't provide any guarantees themselves (hence the syncfs(2) call later), but I did not expect them to break zeroing. (That sounds problematic from a security perspective?) Is sync_file_range(2) similarly problematic with ext4? > Hence if you use range based posix_fadvise(DONTNEED) and/or > sync_file_range(), you can get real strange things happening as data > writeback ordering is fully under the control of the user, not the > filesystem, and the user, in general, has no clue about what the > filesystem is doing under the covers. > > > (Coincidentally, all of this code was just recently rewritten to do a > > simple fsync. I haven't retested the powercycling since then, though.) > > fsync should work just fine, as it does increasing offset writeback > and that means the zeroed blocks will get written before the data > blocks and so uninitialised data won't get exposed.... For the stable bugfix fdatasync(2) is preferred; that will do the same increasing offset writeback I assume? Thanks, Dave! sage _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs