Re: CephFS "corruption" -- Nulled bytes

Adam Tygart <mozes@xxxxxxx> · Wed, 7 Oct 2015 23:00:56 -0500

Does this patch fix files that have been corrupted in this manner?

If not, or I guess even if it does, is there a way to walk the
metadata and data pools and find objects that are affected?

Is that '_' xattr in hammer? If so, how can I access it? Doing a
listxattr on the inode just lists 'parent', and doing the same on the
parent directory's inode simply lists 'parent'.

Thanks for your time.

--
Adam

On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Mon, 5 Oct 2015, Adam Tygart wrote:
>> Okay, this has happened several more times. Always seems to be a small
>> file that should be read-only (perhaps simultaneously) on many
>> different clients. It is just through the cephfs interface that the
>> files are corrupted, the objects in the cachepool and erasure coded
>> pool are still correct. I am beginning to doubt these files are
>> getting a truncation request.
>
> This is still consistent with the #12551 bug.  The object data is correct,
> but the cephfs truncation metadata on the object is wrong, causing it to
> be implicitly zeroed out on read.  It's easily triggered by writers who
> use O_TRUNC on open...
>
>> Twice now have been different perl files, once was someones .bashrc,
>> once was an input file for another application, timestamps on the
>> files indicate that the files haven't been modified in weeks.
>>
>> Any other possibilites? Or any way to figure out what happened?
>
> You can confirm by extracting the '_' xattr on the object (append any @1
> etc fragments) and feeding it to ceph-dencoder with
>
>  ceph-dencoder type object_info_t import <path_to_extrated_xattr> decode dump_json
>
> and confirming that truncate_seq is 0, and verifying that the truncate_seq
> on the read request is non-zero.. you'd need to turn up the osd logs with
> debug ms = 1 and look for the osd_op that looks like "read 0~$length
> [$truncate_seq@$truncate_size]" (with real values in there).
>
> ...but it really sounds like you're hitting the bug.  Unfortunately
> the fix is not backported to hammer just yet.  You can follow
>         http://tracker.ceph.com/issues/13034
>
> sage
>
>
>
>>
>> --
>> Adam
>>
>> On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>> > I've done some digging into cp and mv's semantics (from coreutils). If
>> > the inode is existing, the file will get truncated, then data will get
>> > copied in. This is definitely within the scope of the bug above.
>> >
>> > --
>> > Adam
>> >
>> > On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>> >> It may have been. Although the timestamp on the file was almost a
>> >> month ago. The typical workflow for this particular file is to copy an
>> >> updated version overtop of it.
>> >>
>> >> i.e. 'cp qss kstat'
>> >>
>> >> I'm not sure if cp semantics would keep the same inode and simply
>> >> truncate/overwrite the contents, or if it would do an unlink and then
>> >> create a new file.
>> >> --
>> >> Adam
>> >>
>> >> On Fri, Sep 25, 2015 at 8:00 PM, Ivo Jimenez <ivo@xxxxxxxxxxx> wrote:
>> >>> Looks like you might be experiencing this bug:
>> >>>
>> >>>   http://tracker.ceph.com/issues/12551
>> >>>
>> >>> Fix has been merged to master and I believe it'll be part of infernalis. The
>> >>> original reproducer involved truncating/overwriting files. In your example,
>> >>> do you know if 'kstat' has been truncated/overwritten prior to generating
>> >>> the md5sums?
>> >>>
>> >>> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart <mozes@xxxxxxx> wrote:
>> >>>>
>> >>>> Hello all,
>> >>>>
>> >>>> I've run into some sort of bug with CephFS. Client reads of a
>> >>>> particular file return nothing but 40KB of Null bytes. Doing a rados
>> >>>> level get of the inode returns the whole file, correctly.
>> >>>>
>> >>>> Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client.
>> >>>>
>> >>>> Attached is a dynamic printk debug of the ceph module from the linux
>> >>>> 4.2 client while cat'ing the file.
>> >>>>
>> >>>> My current thought is that there has to be a cache of the object
>> >>>> *somewhere* that a 'rados get' bypasses.
>> >>>>
>> >>>> Even on hosts that have *never* read the file before, it is returning
>> >>>> Null bytes from the kernel and fuse mounts.
>> >>>>
>> >>>> Background:
>> >>>>
>> >>>> 24x CentOS 7.1 hosts serving up RBD and CephFS with Ceph 0.94.3.
>> >>>> CephFS is a EC k=8, m=4 pool with a size 3 writeback cache in front of it.
>> >>>>
>> >>>> # rados -p cachepool get 10004096b95.00000000 /tmp/kstat-cache
>> >>>> # rados -p ec84pool get 10004096b95.00000000 /tmp/kstat-ec
>> >>>> # md5sum /tmp/kstat*
>> >>>> ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-cache
>> >>>> ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-ec
>> >>>> # file /tmp/kstat*
>> >>>> /tmp/kstat-cache: Perl script, ASCII text executable
>> >>>> /tmp/kstat-ec:    Perl script, ASCII text executable
>> >>>>
>> >>>> # md5sum ~daveturner/bin/kstat
>> >>>> 1914e941c2ad5245a23e3e1d27cf8fde  /homes/daveturner/bin/kstat
>> >>>> # file ~daveturner/bin/kstat
>> >>>> /homes/daveturner/bin/kstat: data
>> >>>>
>> >>>> Thoughts?
>> >>>>
>> >>>> Any more information you need?
>> >>>>
>> >>>> --
>> >>>> Adam
>> >>>> _______________________________________________
>> >>>> ceph-users mailing list
>> >>>> ceph-users@xxxxxxxxxxxxxx
>> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com