Re: CephFS "corruption" -- Nulled bytes

Adam Tygart <mozes@xxxxxxx> · Wed, 14 Oct 2015 13:43:27 -0500

Not thoroughly tested, but I've got a quick and dirty script to fix
these up. Worst case scenario, it does nothing. In my limited testing,
the contents of the files comes back without a remount of cephfs.

https://github.com/BeocatKSU/admin/blob/master/ec_cephfs_fixer.py

--
Adam

On Thu, Oct 8, 2015 at 11:11 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
> Hi Sage,
>
> Will this patch be in 0.94.4? We've got the same problem here.
>
> -Lincoln
>
>> On Oct 8, 2015, at 12:11 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>
>> On Wed, 7 Oct 2015, Adam Tygart wrote:
>>> Does this patch fix files that have been corrupted in this manner?
>>
>> Nope, it'll only prevent it from happening to new files (that haven't yet
>> been migrated between the cache and base tier).
>>
>>> If not, or I guess even if it does, is there a way to walk the
>>> metadata and data pools and find objects that are affected?
>>
>> Hmm, this may actually do the trick.. find a file that appears to be
>> zeroed, and do truncate it up and then down again.  For example, of foo is
>> 100 bytes, do
>>
>> truncate --size 101 foo
>> truncate --size 100 foo
>>
>> then unmount and remound the client and see if the content reappears.
>>
>> Assuming that works (it did in my simple test) it'd be pretty easy to
>> write something that walks the tree and does the truncate trick for any
>> file whose first however many bytes are 0 (though it will mess up
>> mtime...).
>>
>>> Is that '_' xattr in hammer? If so, how can I access it? Doing a
>>> listxattr on the inode just lists 'parent', and doing the same on the
>>> parent directory's inode simply lists 'parent'.
>>
>> This is the file in /var/lib/ceph/osd/ceph-NNN/current.  For example,
>>
>> $ attr -l ./3.0_head/10000000000.00000000__head_F0B56F30__3
>> Attribute "cephos.spill_out" has a 2 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
>> Attribute "cephos.seq" has a 23 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
>> Attribute "ceph._" has a 250 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
>> Attribute "ceph._@1" has a 5 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
>> Attribute "ceph.snapset" has a 31 byte value for ./3.0_head/10000000000.00000000__head_F0B56F30__3
>>
>> ...but hopefully you won't need to touch any of that ;)
>>
>> sage
>>
>>
>>>
>>> Thanks for your time.
>>>
>>> --
>>> Adam
>>>
>>>
>>> On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>>> On Mon, 5 Oct 2015, Adam Tygart wrote:
>>>>> Okay, this has happened several more times. Always seems to be a small
>>>>> file that should be read-only (perhaps simultaneously) on many
>>>>> different clients. It is just through the cephfs interface that the
>>>>> files are corrupted, the objects in the cachepool and erasure coded
>>>>> pool are still correct. I am beginning to doubt these files are
>>>>> getting a truncation request.
>>>>
>>>> This is still consistent with the #12551 bug.  The object data is correct,
>>>> but the cephfs truncation metadata on the object is wrong, causing it to
>>>> be implicitly zeroed out on read.  It's easily triggered by writers who
>>>> use O_TRUNC on open...
>>>>
>>>>> Twice now have been different perl files, once was someones .bashrc,
>>>>> once was an input file for another application, timestamps on the
>>>>> files indicate that the files haven't been modified in weeks.
>>>>>
>>>>> Any other possibilites? Or any way to figure out what happened?
>>>>
>>>> You can confirm by extracting the '_' xattr on the object (append any @1
>>>> etc fragments) and feeding it to ceph-dencoder with
>>>>
>>>> ceph-dencoder type object_info_t import <path_to_extrated_xattr> decode dump_json
>>>>
>>>> and confirming that truncate_seq is 0, and verifying that the truncate_seq
>>>> on the read request is non-zero.. you'd need to turn up the osd logs with
>>>> debug ms = 1 and look for the osd_op that looks like "read 0~$length
>>>> [$truncate_seq@$truncate_size]" (with real values in there).
>>>>
>>>> ...but it really sounds like you're hitting the bug.  Unfortunately
>>>> the fix is not backported to hammer just yet.  You can follow
>>>>        http://tracker.ceph.com/issues/13034
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>>
>>>>> --
>>>>> Adam
>>>>>
>>>>> On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>>>>>> I've done some digging into cp and mv's semantics (from coreutils). If
>>>>>> the inode is existing, the file will get truncated, then data will get
>>>>>> copied in. This is definitely within the scope of the bug above.
>>>>>>
>>>>>> --
>>>>>> Adam
>>>>>>
>>>>>> On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>>>>>>> It may have been. Although the timestamp on the file was almost a
>>>>>>> month ago. The typical workflow for this particular file is to copy an
>>>>>>> updated version overtop of it.
>>>>>>>
>>>>>>> i.e. 'cp qss kstat'
>>>>>>>
>>>>>>> I'm not sure if cp semantics would keep the same inode and simply
>>>>>>> truncate/overwrite the contents, or if it would do an unlink and then
>>>>>>> create a new file.
>>>>>>> --
>>>>>>> Adam
>>>>>>>
>>>>>>> On Fri, Sep 25, 2015 at 8:00 PM, Ivo Jimenez <ivo@xxxxxxxxxxx> wrote:
>>>>>>>> Looks like you might be experiencing this bug:
>>>>>>>>
>>>>>>>>  http://tracker.ceph.com/issues/12551
>>>>>>>>
>>>>>>>> Fix has been merged to master and I believe it'll be part of infernalis. The
>>>>>>>> original reproducer involved truncating/overwriting files. In your example,
>>>>>>>> do you know if 'kstat' has been truncated/overwritten prior to generating
>>>>>>>> the md5sums?
>>>>>>>>
>>>>>>>> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart <mozes@xxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> I've run into some sort of bug with CephFS. Client reads of a
>>>>>>>>> particular file return nothing but 40KB of Null bytes. Doing a rados
>>>>>>>>> level get of the inode returns the whole file, correctly.
>>>>>>>>>
>>>>>>>>> Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client.
>>>>>>>>>
>>>>>>>>> Attached is a dynamic printk debug of the ceph module from the linux
>>>>>>>>> 4.2 client while cat'ing the file.
>>>>>>>>>
>>>>>>>>> My current thought is that there has to be a cache of the object
>>>>>>>>> *somewhere* that a 'rados get' bypasses.
>>>>>>>>>
>>>>>>>>> Even on hosts that have *never* read the file before, it is returning
>>>>>>>>> Null bytes from the kernel and fuse mounts.
>>>>>>>>>
>>>>>>>>> Background:
>>>>>>>>>
>>>>>>>>> 24x CentOS 7.1 hosts serving up RBD and CephFS with Ceph 0.94.3.
>>>>>>>>> CephFS is a EC k=8, m=4 pool with a size 3 writeback cache in front of it.
>>>>>>>>>
>>>>>>>>> # rados -p cachepool get 10004096b95.00000000 /tmp/kstat-cache
>>>>>>>>> # rados -p ec84pool get 10004096b95.00000000 /tmp/kstat-ec
>>>>>>>>> # md5sum /tmp/kstat*
>>>>>>>>> ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-cache
>>>>>>>>> ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-ec
>>>>>>>>> # file /tmp/kstat*
>>>>>>>>> /tmp/kstat-cache: Perl script, ASCII text executable
>>>>>>>>> /tmp/kstat-ec:    Perl script, ASCII text executable
>>>>>>>>>
>>>>>>>>> # md5sum ~daveturner/bin/kstat
>>>>>>>>> 1914e941c2ad5245a23e3e1d27cf8fde  /homes/daveturner/bin/kstat
>>>>>>>>> # file ~daveturner/bin/kstat
>>>>>>>>> /homes/daveturner/bin/kstat: data
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> Any more information you need?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Adam
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com