Re: MDS: journaler.pq decode error

Benjeman Meekhof <bmeekhof@xxxxxxxxx> · Thu, 21 Jun 2018 11:32:57 -0400

Thanks very much John!  Skipping over the corrupt entry by setting a
new expire_pos seems to have worked.  The journal expire_pos is now
advancing and pools are being purged.  It has a little while to go to
catch up to current write_pos but the journal inspect command gives an
'OK' for overall integrity.

As recommended I did take an export of the journal first and I'll take
a stab at using a hex editor on it near future.  Worst case we go
through the tag/scan if necessary.

thanks,
Ben

On Thu, Jun 21, 2018 at 9:04 AM, John Spray <jspray@xxxxxxxxxx> wrote:
> On Wed, Jun 20, 2018 at 2:17 PM Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote:
>>
>> Thanks for the response.  I was also hoping to be able to debug better
>> once we got onto Mimic.  We just finished that upgrade yesterday and
>> cephfs-journal-tool does find a corruption in the purge queue though
>> our MDS continues to startup and the filesystem appears to be
>> functional as usual.
>>
>> How can I modify the purge queue to remove damaged sections?
>
> Before any modifications, use the "cephfs-journal-tool
> --journal=purge_queue export <file>" command to take a backup.
>
> The 'splice' mode of cephfs-journal-tool exists for this purpose with
> the ordinary log, but unfortunately we don't have an equivalent for
> the purge queue at the moment (http://tracker.ceph.com/issues/24604)
>
> Since the header commands are implemented, you could do a "header set
> expire_pos 6822485" to skip over the corrupt entry.  You would run
> that while the MDS was not running, and then start the MDS again
> afterwards.  Hopefully all the subsequent entries are valid, and you
> will only be orphaning the objects for one file.
>
>> Is there
>> some way to scan known FS objects and remove any that might now be
>> orphaned once the damage is removed/repaired?
>
> Yes, we do that using the online "ceph daemon mds.<id> tag path"
> operation to tag all the non-orphan files' data objects, followed by a
> cephfs-data-scan operation to scan for anything untagged.  However, in
> this instance I'd be more inclined to try going over your journal
> export with a hex editor to see in what way it was corrupted, and
> maybe the inode of the entry we skipped will still be visible there,
> saving you a big O(N) scan over data objects.
>
> John
>
>>
>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>
>> Overall journal integrity: DAMAGED
>> Corrupt regions:
>>   0x6819f8-681a55
>>
>> # cephfs-journal-tool --journal=purge_queue header get
>>
>> {
>>     "magic": "ceph fs volume v011",
>>     "write_pos": 203357732,
>>     "expire_pos": 6822392,
>>     "trimmed_pos": 4194304,
>>     "stream_format": 1,
>>     "layout": {
>>         "stripe_unit": 4194304,
>>         "stripe_count": 1,
>>         "object_size": 4194304,
>>         "pool_id": 64,
>>         "pool_ns": ""
>>     }
>> }
>>
>> thanks,
>> Ben
>>
>> On Fri, Jun 15, 2018 at 11:54 AM, John Spray <jspray@xxxxxxxxxx> wrote:
>> > On Fri, Jun 15, 2018 at 2:55 PM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote:
>> >> Have seen some posts and issue trackers related to this topic in the
>> >> past but haven't been able to put it together to resolve the issue I'm
>> >> having.  All on Luminous 12.2.5 (upgraded over time from past
>> >> releases).  We are going to upgrade to Mimic near future if that would
>> >> somehow resolve the issue.
>> >>
>> >> Summary:
>> >>
>> >> 1.  We have a CephFS data pool which has steadily and slowly grown in
>> >> size without corresponding writes to the directory placed on it - a
>> >> plot of usage over a few hours shows a very regular upward rate of
>> >> increase.   The pool is now 300TB vs 16TB of actual space used in
>> >> directory.
>> >>
>> >> 2.  Reading through some email posts and issue trackers led me to
>> >> disabling 'standby replay' though we are not and have not ever used
>> >> snapshots.   Disabling that feature on our 3 MDS stopped the steady
>> >> climb.  However the pool remains with 300TB of unaccounted for space
>> >> usage.  http://tracker.ceph.com/issues/19593 and
>> >> http://tracker.ceph.com/issues/21551
>> >
>> > This is pretty strange -- if you were already on 12.2.5 then the
>> > http://tracker.ceph.com/issues/19593 should have been fixed and
>> > switching standby replays on/off shouldn't make a difference (unless
>> > there's some similar bug that crept back into luminous).
>> >
>> >> 3.   I've never had any issue starting the MDS or with filesystem
>> >> functionality but looking through the mds logs I see a single
>> >> 'journaler.pg(rw) _decode error from assimilate_prefetch' at every
>> >> startup.  A log snippet with context is below with debug_mds and
>> >> debug_journaler at 20.
>> >
>> > This message suggests that the purge queue has been corrupted, but the
>> > MDS is ignoring this -- something is wrong with the error handling.
>> > The MDS should be marked damaged when something like this happens, but
>> > in this case PurgeQueue is apparently dropping the error on the floor
>> > after it gets logged by Journaler.  I've opened a ticket+PR for the
>> > error handling here: http://tracker.ceph.com/issues/24533 (however,
>> > the loading path in PurgeQueue::_recover *does* have error handling so
>> > I'm not clear why that isn't happening in your case).
>> >
>> > I believe cephfs-journal-tool in mimic was enhanced to be able to
>> > optionally operate on the purge queue as well as the metadata journal
>> > (they use the same underlying format), so upgrading to mimic would
>> > give you better tooling for debugging this.
>> >
>> > John
>> >
>> >
>> >> As noted, there is at least one past email thread on the topic but I'm
>> >> not quite having the same issue as this person and I couldn't glean
>> >> any information as to what I should do to repair this error and get
>> >> stale objects purged from this pool (if that is in fact the issue):
>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021379.html
>> >>
>> >> Any thoughts on troubleshooting steps I could try next?
>> >>
>> >> Here is the log snippet:
>> >>
>> >> 2018-06-15 09:14:50.746831 7fb47251b700 20 mds.0.journaler.pq(rw)
>> >> write_buf_throttle get, delta 101
>> >> 2018-06-15 09:14:50.746835 7fb47251b700 10 mds.0.journaler.pq(rw)
>> >> append_entry len 81 to 88121773~101
>> >> 2018-06-15 09:14:50.746838 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
>> >> 2018-06-15 09:14:50.746863 7fb47251b700 20 mds.0.journaler.pq(rw)
>> >> write_buf_throttle get, delta 101
>> >> 2018-06-15 09:14:50.746864 7fb47251b700 10 mds.0.journaler.pq(rw)
>> >> append_entry len 81 to 88121874~101
>> >> 2018-06-15 09:14:50.746867 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
>> >> 2018-06-15 09:14:50.746901 7fb46fd16700 10 mds.0.journaler.pq(rw)
>> >> _finish_read got 6822392~1566216
>> >> 2018-06-15 09:14:50.746909 7fb46fd16700 10 mds.0.journaler.pq(rw)
>> >> _assimilate_prefetch 6822392~1566216
>> >> 2018-06-15 09:14:50.746911 7fb46fd16700 10 mds.0.journaler.pq(rw)
>> >> _assimilate_prefetch gap of 4194304 from received_pos 8388608 to first
>> >> prefetched buffer 12582912
>> >> 2018-06-15 09:14:50.746913 7fb46fd16700 10 mds.0.journaler.pq(rw)
>> >> _assimilate_prefetch read_buf now 6822392~1566216, read pointers
>> >> 6822392/8388608/50331648
>> >>
>> >> === error here ===> 2018-06-15 09:14:50.746965 7fb46fd16700 -1
>> >> mds.0.journaler.pq(rw) _decode error from assimilate_prefetch
>> >>
>> >> 2018-06-15 09:14:50.746994 7fb47251b700 20 mds.0.journaler.pq(rw)
>> >> write_buf_throttle get, delta 101
>> >> 2018-06-15 09:14:50.746998 7fb47251b700 10 mds.0.journaler.pq(rw)
>> >> append_entry len 81 to 88121975~101
>> >> 2018-06-15 09:14:50.747007 7fb47251b700 10 mds.0.journaler.pq(rw)
>> >> wait_for_readable at 6822392 onreadable 0x557ee0f58300
>> >> 2018-06-15 09:14:50.747042 7fb47251b700 20 mds.0.journaler.pq(rw)
>> >> write_buf_throttle get, delta 101
>> >> 2018-06-15 09:14:50.747043 7fb47251b700 10 mds.0.journaler.pq(rw)
>> >> append_entry len 81 to 88122076~101
>> >> 2018-06-15 09:14:50.747063 7fb47251b700 20 mds.0.journaler.pq(rw)
>> >> write_buf_throttle get, delta 101
>> >> 2018-06-15 09:14:50.747064 7fb47251b700 10 mds.0.journaler.pq(rw)
>> >> append_entry len 81 to 88122177~101
>> >> 2018-06-15 09:14:50.747113 7fb47251b700 20 mds.0.journaler.pq(rw)
>> >> write_buf_throttle get, delta 101
>> >> 2018-06-15 09:14:50.747114 7fb47251b700 10 mds.0.journaler.pq(rw)
>> >> append_entry len 81 to 88122278~101
>> >> 2018-06-15 09:14:50.747136 7fb47251b700 20 mds.0.journaler.pq(rw)
>> >> write_buf_throttle get, delta 101
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com