On Wed, 19 Jun 2013, Dave Chinner wrote: > On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote: > > Hi guys, > > > > I reproduced this on two more boxes and have more data. The full set of > > notes/logs is at > > > > http://newdream.net/~sage/bug-4976/notes.txt > > case c: > > /var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3: > 0: [0..7]: 244206232..244206239 > 1: [8..1351]: hole > 2: [1352..2431]: 244252824..244253903 > 3: [2432..3255]: hole > 4: [3256..4855]: 244254728..244256327 > > bad data starts at offset 1179648: > > Which lies within the middle of an allocated extent (offset 2304bb). > > IIUC, then there was a write at offset 700466 for 445465 bytes, > (i.e start @ 1368bb, end @ 2239bb), but given the block count of > the file didn't change, this must have been an overwrite of existing > data. It's well within EOF, too, so it's not clear what has happened > here - the bad data was not written by the ceph journal replay, and > the extent was already allocated on disk... > > case d: > > 0: [0..7]: 732632192..732632199 > 1: [8..783]: hole > 2: [784..2943]: 733513808..733515967 > 3: [2944..3511]: hole > 4: [3512..4703]: 733516536..733517727 > > > INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648 (((same offset as previous case!!!)) > > So, similar layout, we have a write at offset 404703 for 773518 > bytes (start @790bb, end @2302). Ok, that makes a little more sense > to have corruption starting @ 2304. The pages in the page cache that > cover this write would cover offsets 784bb to 2303bb, having zeroed > the head and tail sections. > > The file size and number of blocks didn't change again and it's > well within the file size, so this is also an overwrite without > allocat. IOWs the overwrite ended at 2303bb, with corruption > starting at 2304bb. > > Let's play a "what-if" game. there are 3 writes to the file: > > -@ 0 for 57 bytes (1FSB) > -@ 404703 for 773518 bytes ( > -@ 1801584 for 603119 bytes > > These translate to the following regions in BB: > > - [0..1] > - [790..2302] > - [3518..4697] > > If we round these to filesystem blocks, we have: > > - [0..7] > - [784..2303] > - [3512..4703] > > And in filesystem blocks: > > Should have Actually have > ----------- ------------ > - [0..1] - [0..1] > - [98..287] - [98..367] > - [439..587] - [439..587] > > So there's an extra 80 filesystem blocks in the middle extent. Does > that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 = > 340fsb. it doesn't, so there's another 60fsb beyond EOF. > > And that's pretty telling. There's non-zero'd speculative > preallocation there. The *big* question is this: why was the file > size extented (i.e. user data IO completed) and updated and the > file size update logged to the journal *before* the zeroed regions > of the file had been written to zero? > > And why only on 3.8/3.9? I haven't gone back and tested on older kernels. It's possible this isn't a new problem and previous failures were lost in the noise. :/ > > Any ideas? > > I'd love to see the code that is writing the files in the first > place - it's not playing silly buggers with sync_file_range(), is > it? Actually, it is! ... ::sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE); ... int fa_r = posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED); ...and a call to close(2). The intent was just to push things out of the disk so that a syncfs() that comes later will take less time. I wouldn't have expected this to change the commit ordering semantics, though; only to initiate writeback sooner. (Coincidentally, all of this code was just recently rewritten to do a simple fsync. I haven't retested the powercycling since then, though.) sage _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs