Not thoroughly tested, but I've got a quick and dirty script to fix these up. Worst case scenario, it does nothing. In my limited testing, the contents of the files comes back without a remount of cephfs. https://github.com/BeocatKSU/admin/blob/master/ec_cephfs_fixer.py -- Adam On Thu, Oct 8, 2015 at 11:11 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote: > Hi Sage, > > Will this patch be in 0.94.4? We've got the same problem here. > > -Lincoln > >> On Oct 8, 2015, at 12:11 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> >> On Wed, 7 Oct 2015, Adam Tygart wrote: >>> Does this patch fix files that have been corrupted in this manner? >> >> Nope, it'll only prevent it from happening to new files (that haven't yet >> been migrated between the cache and base tier). >> >>> If not, or I guess even if it does, is there a way to walk the >>> metadata and data pools and find objects that are affected? >> >> Hmm, this may actually do the trick.. find a file that appears to be >> zeroed, and do truncate it up and then down again. For example, of foo is >> 100 bytes, do >> >> truncate --size 101 foo >> truncate --size 100 foo >> >> then unmount and remound the client and see if the content reappears. >> >> Assuming that works (it did in my simple test) it'd be pretty easy to >> write something that walks the tree and does the truncate trick for any >> file whose first however many bytes are 0 (though it will mess up >> mtime...). >> >>> Is that '_' xattr in hammer? If so, how can I access it? Doing a >>> listxattr on the inode just lists 'parent', and doing the same on the >>> parent directory's inode simply lists 'parent'. >> >> This is the file in /var/lib/ceph/osd/ceph-NNN/current. For example, >> >> $ attr -l ./3.0_head/10000000000.00000000__head_F0B56F30__3 >> Attribute "cephos.spill_out" has a 2 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3 >> Attribute "cephos.seq" has a 23 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3 >> Attribute "ceph._" has a 250 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3 >> Attribute "ceph._@1" has a 5 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3 >> Attribute "ceph.snapset" has a 31 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3 >> >> ...but hopefully you won't need to touch any of that ;) >> >> sage >> >> >>> >>> Thanks for your time. >>> >>> -- >>> Adam >>> >>> >>> On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>>> On Mon, 5 Oct 2015, Adam Tygart wrote: >>>>> Okay, this has happened several more times. Always seems to be a small >>>>> file that should be read-only (perhaps simultaneously) on many >>>>> different clients. It is just through the cephfs interface that the >>>>> files are corrupted, the objects in the cachepool and erasure coded >>>>> pool are still correct. I am beginning to doubt these files are >>>>> getting a truncation request. >>>> >>>> This is still consistent with the #12551 bug. The object data is correct, >>>> but the cephfs truncation metadata on the object is wrong, causing it to >>>> be implicitly zeroed out on read. It's easily triggered by writers who >>>> use O_TRUNC on open... >>>> >>>>> Twice now have been different perl files, once was someones .bashrc, >>>>> once was an input file for another application, timestamps on the >>>>> files indicate that the files haven't been modified in weeks. >>>>> >>>>> Any other possibilites? Or any way to figure out what happened? >>>> >>>> You can confirm by extracting the '_' xattr on the object (append any @1 >>>> etc fragments) and feeding it to ceph-dencoder with >>>> >>>> ceph-dencoder type object_info_t import <path_to_extrated_xattr> decode dump_json >>>> >>>> and confirming that truncate_seq is 0, and verifying that the truncate_seq >>>> on the read request is non-zero.. you'd need to turn up the osd logs with >>>> debug ms = 1 and look for the osd_op that looks like "read 0~$length >>>> [$truncate_seq@$truncate_size]" (with real values in there). >>>> >>>> ...but it really sounds like you're hitting the bug. Unfortunately >>>> the fix is not backported to hammer just yet. You can follow >>>> http://tracker.ceph.com/issues/13034 >>>> >>>> sage >>>> >>>> >>>> >>>>> >>>>> -- >>>>> Adam >>>>> >>>>> On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart <mozes@xxxxxxx> wrote: >>>>>> I've done some digging into cp and mv's semantics (from coreutils). If >>>>>> the inode is existing, the file will get truncated, then data will get >>>>>> copied in. This is definitely within the scope of the bug above. >>>>>> >>>>>> -- >>>>>> Adam >>>>>> >>>>>> On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart <mozes@xxxxxxx> wrote: >>>>>>> It may have been. Although the timestamp on the file was almost a >>>>>>> month ago. The typical workflow for this particular file is to copy an >>>>>>> updated version overtop of it. >>>>>>> >>>>>>> i.e. 'cp qss kstat' >>>>>>> >>>>>>> I'm not sure if cp semantics would keep the same inode and simply >>>>>>> truncate/overwrite the contents, or if it would do an unlink and then >>>>>>> create a new file. >>>>>>> -- >>>>>>> Adam >>>>>>> >>>>>>> On Fri, Sep 25, 2015 at 8:00 PM, Ivo Jimenez <ivo@xxxxxxxxxxx> wrote: >>>>>>>> Looks like you might be experiencing this bug: >>>>>>>> >>>>>>>> http://tracker.ceph.com/issues/12551 >>>>>>>> >>>>>>>> Fix has been merged to master and I believe it'll be part of infernalis. The >>>>>>>> original reproducer involved truncating/overwriting files. In your example, >>>>>>>> do you know if 'kstat' has been truncated/overwritten prior to generating >>>>>>>> the md5sums? >>>>>>>> >>>>>>>> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart <mozes@xxxxxxx> wrote: >>>>>>>>> >>>>>>>>> Hello all, >>>>>>>>> >>>>>>>>> I've run into some sort of bug with CephFS. Client reads of a >>>>>>>>> particular file return nothing but 40KB of Null bytes. Doing a rados >>>>>>>>> level get of the inode returns the whole file, correctly. >>>>>>>>> >>>>>>>>> Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client. >>>>>>>>> >>>>>>>>> Attached is a dynamic printk debug of the ceph module from the linux >>>>>>>>> 4.2 client while cat'ing the file. >>>>>>>>> >>>>>>>>> My current thought is that there has to be a cache of the object >>>>>>>>> *somewhere* that a 'rados get' bypasses. >>>>>>>>> >>>>>>>>> Even on hosts that have *never* read the file before, it is returning >>>>>>>>> Null bytes from the kernel and fuse mounts. >>>>>>>>> >>>>>>>>> Background: >>>>>>>>> >>>>>>>>> 24x CentOS 7.1 hosts serving up RBD and CephFS with Ceph 0.94.3. >>>>>>>>> CephFS is a EC k=8, m=4 pool with a size 3 writeback cache in front of it. >>>>>>>>> >>>>>>>>> # rados -p cachepool get 10004096b95.00000000 /tmp/kstat-cache >>>>>>>>> # rados -p ec84pool get 10004096b95.00000000 /tmp/kstat-ec >>>>>>>>> # md5sum /tmp/kstat* >>>>>>>>> ddfbe886420a2cb860b46dc70f4f9a0d /tmp/kstat-cache >>>>>>>>> ddfbe886420a2cb860b46dc70f4f9a0d /tmp/kstat-ec >>>>>>>>> # file /tmp/kstat* >>>>>>>>> /tmp/kstat-cache: Perl script, ASCII text executable >>>>>>>>> /tmp/kstat-ec: Perl script, ASCII text executable >>>>>>>>> >>>>>>>>> # md5sum ~daveturner/bin/kstat >>>>>>>>> 1914e941c2ad5245a23e3e1d27cf8fde /homes/daveturner/bin/kstat >>>>>>>>> # file ~daveturner/bin/kstat >>>>>>>>> /homes/daveturner/bin/kstat: data >>>>>>>>> >>>>>>>>> Thoughts? >>>>>>>>> >>>>>>>>> Any more information you need? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Adam >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com