Re: CephFS "corruption" -- Nulled bytes

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 7 Oct 2015 22:11:28 -0700 (PDT)

On Wed, 7 Oct 2015, Adam Tygart wrote:
> Does this patch fix files that have been corrupted in this manner?

Nope, it'll only prevent it from happening to new files (that haven't yet 
been migrated between the cache and base tier).

> If not, or I guess even if it does, is there a way to walk the
> metadata and data pools and find objects that are affected?

Hmm, this may actually do the trick.. find a file that appears to be 
zeroed, and do truncate it up and then down again.  For example, of foo is 
100 bytes, do

 truncate --size 101 foo
 truncate --size 100 foo

then unmount and remound the client and see if the content reappears.

Assuming that works (it did in my simple test) it'd be pretty easy to 
write something that walks the tree and does the truncate trick for any 
file whose first however many bytes are 0 (though it will mess up 
mtime...).

> Is that '_' xattr in hammer? If so, how can I access it? Doing a
> listxattr on the inode just lists 'parent', and doing the same on the
> parent directory's inode simply lists 'parent'.

This is the file in /var/lib/ceph/osd/ceph-NNN/current.  For example,

$ attr -l ./3.0_head/10000000000.00000000__head_F0B56F30__3
Attribute "cephos.spill_out" has a 2 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
Attribute "cephos.seq" has a 23 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
Attribute "ceph._" has a 250 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
Attribute "ceph._@1" has a 5 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
Attribute "ceph.snapset" has a 31 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3

...but hopefully you won't need to touch any of that ;)

sage

> 
> Thanks for your time.
> 
> --
> Adam
> 
> 
> On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Mon, 5 Oct 2015, Adam Tygart wrote:
> >> Okay, this has happened several more times. Always seems to be a small
> >> file that should be read-only (perhaps simultaneously) on many
> >> different clients. It is just through the cephfs interface that the
> >> files are corrupted, the objects in the cachepool and erasure coded
> >> pool are still correct. I am beginning to doubt these files are
> >> getting a truncation request.
> >
> > This is still consistent with the #12551 bug.  The object data is correct,
> > but the cephfs truncation metadata on the object is wrong, causing it to
> > be implicitly zeroed out on read.  It's easily triggered by writers who
> > use O_TRUNC on open...
> >
> >> Twice now have been different perl files, once was someones .bashrc,
> >> once was an input file for another application, timestamps on the
> >> files indicate that the files haven't been modified in weeks.
> >>
> >> Any other possibilites? Or any way to figure out what happened?
> >
> > You can confirm by extracting the '_' xattr on the object (append any @1
> > etc fragments) and feeding it to ceph-dencoder with
> >
> >  ceph-dencoder type object_info_t import <path_to_extrated_xattr> decode dump_json
> >
> > and confirming that truncate_seq is 0, and verifying that the truncate_seq
> > on the read request is non-zero.. you'd need to turn up the osd logs with
> > debug ms = 1 and look for the osd_op that looks like "read 0~$length
> > [$truncate_seq@$truncate_size]" (with real values in there).
> >
> > ...but it really sounds like you're hitting the bug.  Unfortunately
> > the fix is not backported to hammer just yet.  You can follow
> >         http://tracker.ceph.com/issues/13034
> >
> > sage
> >
> >
> >
> >>
> >> --
> >> Adam
> >>
> >> On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart <mozes@xxxxxxx> wrote:
> >> > I've done some digging into cp and mv's semantics (from coreutils). If
> >> > the inode is existing, the file will get truncated, then data will get
> >> > copied in. This is definitely within the scope of the bug above.
> >> >
> >> > --
> >> > Adam
> >> >
> >> > On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart <mozes@xxxxxxx> wrote:
> >> >> It may have been. Although the timestamp on the file was almost a
> >> >> month ago. The typical workflow for this particular file is to copy an
> >> >> updated version overtop of it.
> >> >>
> >> >> i.e. 'cp qss kstat'
> >> >>
> >> >> I'm not sure if cp semantics would keep the same inode and simply
> >> >> truncate/overwrite the contents, or if it would do an unlink and then
> >> >> create a new file.
> >> >> --
> >> >> Adam
> >> >>
> >> >> On Fri, Sep 25, 2015 at 8:00 PM, Ivo Jimenez <ivo@xxxxxxxxxxx> wrote:
> >> >>> Looks like you might be experiencing this bug:
> >> >>>
> >> >>>   http://tracker.ceph.com/issues/12551
> >> >>>
> >> >>> Fix has been merged to master and I believe it'll be part of infernalis. The
> >> >>> original reproducer involved truncating/overwriting files. In your example,
> >> >>> do you know if 'kstat' has been truncated/overwritten prior to generating
> >> >>> the md5sums?
> >> >>>
> >> >>> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart <mozes@xxxxxxx> wrote:
> >> >>>>
> >> >>>> Hello all,
> >> >>>>
> >> >>>> I've run into some sort of bug with CephFS. Client reads of a
> >> >>>> particular file return nothing but 40KB of Null bytes. Doing a rados
> >> >>>> level get of the inode returns the whole file, correctly.
> >> >>>>
> >> >>>> Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client.
> >> >>>>
> >> >>>> Attached is a dynamic printk debug of the ceph module from the linux
> >> >>>> 4.2 client while cat'ing the file.
> >> >>>>
> >> >>>> My current thought is that there has to be a cache of the object
> >> >>>> *somewhere* that a 'rados get' bypasses.
> >> >>>>
> >> >>>> Even on hosts that have *never* read the file before, it is returning
> >> >>>> Null bytes from the kernel and fuse mounts.
> >> >>>>
> >> >>>> Background:
> >> >>>>
> >> >>>> 24x CentOS 7.1 hosts serving up RBD and CephFS with Ceph 0.94.3.
> >> >>>> CephFS is a EC k=8, m=4 pool with a size 3 writeback cache in front of it.
> >> >>>>
> >> >>>> # rados -p cachepool get 10004096b95.00000000 /tmp/kstat-cache
> >> >>>> # rados -p ec84pool get 10004096b95.00000000 /tmp/kstat-ec
> >> >>>> # md5sum /tmp/kstat*
> >> >>>> ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-cache
> >> >>>> ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-ec
> >> >>>> # file /tmp/kstat*
> >> >>>> /tmp/kstat-cache: Perl script, ASCII text executable
> >> >>>> /tmp/kstat-ec:    Perl script, ASCII text executable
> >> >>>>
> >> >>>> # md5sum ~daveturner/bin/kstat
> >> >>>> 1914e941c2ad5245a23e3e1d27cf8fde  /homes/daveturner/bin/kstat
> >> >>>> # file ~daveturner/bin/kstat
> >> >>>> /homes/daveturner/bin/kstat: data
> >> >>>>
> >> >>>> Thoughts?
> >> >>>>
> >> >>>> Any more information you need?
> >> >>>>
> >> >>>> --
> >> >>>> Adam
> >> >>>> _______________________________________________
> >> >>>> ceph-users mailing list
> >> >>>> ceph-users@xxxxxxxxxxxxxx
> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com