Re: MDS: journaler.pq decode error

Benjeman Meekhof <bmeekhof@xxxxxxxxx> · Thu, 21 Jun 2018 11:39:39 -0400

I do have one follow-up related question:  While doing this I took
offline all the standby MDS, and max_mds on our cluster is at 1.  Were
I to enable multiple MDS would they all actively split up processing
the purge queue?  We have not yet at this point ever allowed multi
active MDS but plan to enable now that it's been stable a while (as
well as now being the default in Mimic).  Though it is not enough to
cause problems my MDS right now is at an increased level of CPU usage
processing the queue backlog.  I'm not really inclined to throw
another variable in the mix right now but the answer to the question
might be interesting for future reference.

thanks,
Ben

On Thu, Jun 21, 2018 at 11:32 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote:
> Thanks very much John!  Skipping over the corrupt entry by setting a
> new expire_pos seems to have worked.  The journal expire_pos is now
> advancing and pools are being purged.  It has a little while to go to
> catch up to current write_pos but the journal inspect command gives an
> 'OK' for overall integrity.
>
> As recommended I did take an export of the journal first and I'll take
> a stab at using a hex editor on it near future.  Worst case we go
> through the tag/scan if necessary.
>
> thanks,
> Ben
>
>
> On Thu, Jun 21, 2018 at 9:04 AM, John Spray <jspray@xxxxxxxxxx> wrote:
>> On Wed, Jun 20, 2018 at 2:17 PM Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote:
>>>
>>> Thanks for the response.  I was also hoping to be able to debug better
>>> once we got onto Mimic.  We just finished that upgrade yesterday and
>>> cephfs-journal-tool does find a corruption in the purge queue though
>>> our MDS continues to startup and the filesystem appears to be
>>> functional as usual.
>>>
>>> How can I modify the purge queue to remove damaged sections?
>>
>> Before any modifications, use the "cephfs-journal-tool
>> --journal=purge_queue export <file>" command to take a backup.
>>
>> The 'splice' mode of cephfs-journal-tool exists for this purpose with
>> the ordinary log, but unfortunately we don't have an equivalent for
>> the purge queue at the moment (http://tracker.ceph.com/issues/24604)
>>
>> Since the header commands are implemented, you could do a "header set
>> expire_pos 6822485" to skip over the corrupt entry.  You would run
>> that while the MDS was not running, and then start the MDS again
>> afterwards.  Hopefully all the subsequent entries are valid, and you
>> will only be orphaning the objects for one file.
>>
>>> Is there
>>> some way to scan known FS objects and remove any that might now be
>>> orphaned once the damage is removed/repaired?
>>
>> Yes, we do that using the online "ceph daemon mds.<id> tag path"
>> operation to tag all the non-orphan files' data objects, followed by a
>> cephfs-data-scan operation to scan for anything untagged.  However, in
>> this instance I'd be more inclined to try going over your journal
>> export with a hex editor to see in what way it was corrupted, and
>> maybe the inode of the entry we skipped will still be visible there,
>> saving you a big O(N) scan over data objects.
>>
>> John
>>
>>>
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>>
>>> Overall journal integrity: DAMAGED
>>> Corrupt regions:
>>>   0x6819f8-681a55
>>>
>>> # cephfs-journal-tool --journal=purge_queue header get
>>>
>>> {
>>>     "magic": "ceph fs volume v011",
>>>     "write_pos": 203357732,
>>>     "expire_pos": 6822392,
>>>     "trimmed_pos": 4194304,
>>>     "stream_format": 1,
>>>     "layout": {
>>>         "stripe_unit": 4194304,
>>>         "stripe_count": 1,
>>>         "object_size": 4194304,
>>>         "pool_id": 64,
>>>         "pool_ns": ""
>>>     }
>>> }
>>>
>>> thanks,
>>> Ben
>>>
>>> On Fri, Jun 15, 2018 at 11:54 AM, John Spray <jspray@xxxxxxxxxx> wrote:
>>> > On Fri, Jun 15, 2018 at 2:55 PM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote:
>>> >> Have seen some posts and issue trackers related to this topic in the
>>> >> past but haven't been able to put it together to resolve the issue I'm
>>> >> having.  All on Luminous 12.2.5 (upgraded over time from past
>>> >> releases).  We are going to upgrade to Mimic near future if that would
>>> >> somehow resolve the issue.
>>> >>
>>> >> Summary:
>>> >>
>>> >> 1.  We have a CephFS data pool which has steadily and slowly grown in
>>> >> size without corresponding writes to the directory placed on it - a
>>> >> plot of usage over a few hours shows a very regular upward rate of
>>> >> increase.   The pool is now 300TB vs 16TB of actual space used in
>>> >> directory.
>>> >>
>>> >> 2.  Reading through some email posts and issue trackers led me to
>>> >> disabling 'standby replay' though we are not and have not ever used
>>> >> snapshots.   Disabling that feature on our 3 MDS stopped the steady
>>> >> climb.  However the pool remains with 300TB of unaccounted for space
>>> >> usage.  http://tracker.ceph.com/issues/19593 and
>>> >> http://tracker.ceph.com/issues/21551
>>> >
>>> > This is pretty strange -- if you were already on 12.2.5 then the
>>> > http://tracker.ceph.com/issues/19593 should have been fixed and
>>> > switching standby replays on/off shouldn't make a difference (unless
>>> > there's some similar bug that crept back into luminous).
>>> >
>>> >> 3.   I've never had any issue starting the MDS or with filesystem
>>> >> functionality but looking through the mds logs I see a single
>>> >> 'journaler.pg(rw) _decode error from assimilate_prefetch' at every
>>> >> startup.  A log snippet with context is below with debug_mds and
>>> >> debug_journaler at 20.
>>> >
>>> > This message suggests that the purge queue has been corrupted, but the
>>> > MDS is ignoring this -- something is wrong with the error handling.
>>> > The MDS should be marked damaged when something like this happens, but
>>> > in this case PurgeQueue is apparently dropping the error on the floor
>>> > after it gets logged by Journaler.  I've opened a ticket+PR for the
>>> > error handling here: http://tracker.ceph.com/issues/24533 (however,
>>> > the loading path in PurgeQueue::_recover *does* have error handling so
>>> > I'm not clear why that isn't happening in your case).
>>> >
>>> > I believe cephfs-journal-tool in mimic was enhanced to be able to
>>> > optionally operate on the purge queue as well as the metadata journal
>>> > (they use the same underlying format), so upgrading to mimic would
>>> > give you better tooling for debugging this.
>>> >
>>> > John
>>> >
>>> >
>>> >> As noted, there is at least one past email thread on the topic but I'm
>>> >> not quite having the same issue as this person and I couldn't glean
>>> >> any information as to what I should do to repair this error and get
>>> >> stale objects purged from this pool (if that is in fact the issue):
>>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021379.html
>>> >>
>>> >> Any thoughts on troubleshooting steps I could try next?
>>> >>
>>> >> Here is the log snippet:
>>> >>
>>> >> 2018-06-15 09:14:50.746831 7fb47251b700 20 mds.0.journaler.pq(rw)
>>> >> write_buf_throttle get, delta 101
>>> >> 2018-06-15 09:14:50.746835 7fb47251b700 10 mds.0.journaler.pq(rw)
>>> >> append_entry len 81 to 88121773~101
>>> >> 2018-06-15 09:14:50.746838 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
>>> >> 2018-06-15 09:14:50.746863 7fb47251b700 20 mds.0.journaler.pq(rw)
>>> >> write_buf_throttle get, delta 101
>>> >> 2018-06-15 09:14:50.746864 7fb47251b700 10 mds.0.journaler.pq(rw)
>>> >> append_entry len 81 to 88121874~101
>>> >> 2018-06-15 09:14:50.746867 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
>>> >> 2018-06-15 09:14:50.746901 7fb46fd16700 10 mds.0.journaler.pq(rw)
>>> >> _finish_read got 6822392~1566216
>>> >> 2018-06-15 09:14:50.746909 7fb46fd16700 10 mds.0.journaler.pq(rw)
>>> >> _assimilate_prefetch 6822392~1566216
>>> >> 2018-06-15 09:14:50.746911 7fb46fd16700 10 mds.0.journaler.pq(rw)
>>> >> _assimilate_prefetch gap of 4194304 from received_pos 8388608 to first
>>> >> prefetched buffer 12582912
>>> >> 2018-06-15 09:14:50.746913 7fb46fd16700 10 mds.0.journaler.pq(rw)
>>> >> _assimilate_prefetch read_buf now 6822392~1566216, read pointers
>>> >> 6822392/8388608/50331648
>>> >>
>>> >> === error here ===> 2018-06-15 09:14:50.746965 7fb46fd16700 -1
>>> >> mds.0.journaler.pq(rw) _decode error from assimilate_prefetch
>>> >>
>>> >> 2018-06-15 09:14:50.746994 7fb47251b700 20 mds.0.journaler.pq(rw)
>>> >> write_buf_throttle get, delta 101
>>> >> 2018-06-15 09:14:50.746998 7fb47251b700 10 mds.0.journaler.pq(rw)
>>> >> append_entry len 81 to 88121975~101
>>> >> 2018-06-15 09:14:50.747007 7fb47251b700 10 mds.0.journaler.pq(rw)
>>> >> wait_for_readable at 6822392 onreadable 0x557ee0f58300
>>> >> 2018-06-15 09:14:50.747042 7fb47251b700 20 mds.0.journaler.pq(rw)
>>> >> write_buf_throttle get, delta 101
>>> >> 2018-06-15 09:14:50.747043 7fb47251b700 10 mds.0.journaler.pq(rw)
>>> >> append_entry len 81 to 88122076~101
>>> >> 2018-06-15 09:14:50.747063 7fb47251b700 20 mds.0.journaler.pq(rw)
>>> >> write_buf_throttle get, delta 101
>>> >> 2018-06-15 09:14:50.747064 7fb47251b700 10 mds.0.journaler.pq(rw)
>>> >> append_entry len 81 to 88122177~101
>>> >> 2018-06-15 09:14:50.747113 7fb47251b700 20 mds.0.journaler.pq(rw)
>>> >> write_buf_throttle get, delta 101
>>> >> 2018-06-15 09:14:50.747114 7fb47251b700 10 mds.0.journaler.pq(rw)
>>> >> append_entry len 81 to 88122278~101
>>> >> 2018-06-15 09:14:50.747136 7fb47251b700 20 mds.0.journaler.pq(rw)
>>> >> write_buf_throttle get, delta 101
>>> >> _______________________________________________
>>> >> ceph-users mailing list
>>> >> ceph-users@xxxxxxxxxxxxxx
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com