Thanks very much John! Skipping over the corrupt entry by setting a new expire_pos seems to have worked. The journal expire_pos is now advancing and pools are being purged. It has a little while to go to catch up to current write_pos but the journal inspect command gives an 'OK' for overall integrity. As recommended I did take an export of the journal first and I'll take a stab at using a hex editor on it near future. Worst case we go through the tag/scan if necessary. thanks, Ben On Thu, Jun 21, 2018 at 9:04 AM, John Spray <jspray@xxxxxxxxxx> wrote: > On Wed, Jun 20, 2018 at 2:17 PM Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >> >> Thanks for the response. I was also hoping to be able to debug better >> once we got onto Mimic. We just finished that upgrade yesterday and >> cephfs-journal-tool does find a corruption in the purge queue though >> our MDS continues to startup and the filesystem appears to be >> functional as usual. >> >> How can I modify the purge queue to remove damaged sections? > > Before any modifications, use the "cephfs-journal-tool > --journal=purge_queue export <file>" command to take a backup. > > The 'splice' mode of cephfs-journal-tool exists for this purpose with > the ordinary log, but unfortunately we don't have an equivalent for > the purge queue at the moment (http://tracker.ceph.com/issues/24604) > > Since the header commands are implemented, you could do a "header set > expire_pos 6822485" to skip over the corrupt entry. You would run > that while the MDS was not running, and then start the MDS again > afterwards. Hopefully all the subsequent entries are valid, and you > will only be orphaning the objects for one file. > >> Is there >> some way to scan known FS objects and remove any that might now be >> orphaned once the damage is removed/repaired? > > Yes, we do that using the online "ceph daemon mds.<id> tag path" > operation to tag all the non-orphan files' data objects, followed by a > cephfs-data-scan operation to scan for anything untagged. However, in > this instance I'd be more inclined to try going over your journal > export with a hex editor to see in what way it was corrupted, and > maybe the inode of the entry we skipped will still be visible there, > saving you a big O(N) scan over data objects. > > John > >> >> # cephfs-journal-tool --journal=purge_queue journal inspect >> >> Overall journal integrity: DAMAGED >> Corrupt regions: >> 0x6819f8-681a55 >> >> # cephfs-journal-tool --journal=purge_queue header get >> >> { >> "magic": "ceph fs volume v011", >> "write_pos": 203357732, >> "expire_pos": 6822392, >> "trimmed_pos": 4194304, >> "stream_format": 1, >> "layout": { >> "stripe_unit": 4194304, >> "stripe_count": 1, >> "object_size": 4194304, >> "pool_id": 64, >> "pool_ns": "" >> } >> } >> >> thanks, >> Ben >> >> On Fri, Jun 15, 2018 at 11:54 AM, John Spray <jspray@xxxxxxxxxx> wrote: >> > On Fri, Jun 15, 2018 at 2:55 PM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >> >> Have seen some posts and issue trackers related to this topic in the >> >> past but haven't been able to put it together to resolve the issue I'm >> >> having. All on Luminous 12.2.5 (upgraded over time from past >> >> releases). We are going to upgrade to Mimic near future if that would >> >> somehow resolve the issue. >> >> >> >> Summary: >> >> >> >> 1. We have a CephFS data pool which has steadily and slowly grown in >> >> size without corresponding writes to the directory placed on it - a >> >> plot of usage over a few hours shows a very regular upward rate of >> >> increase. The pool is now 300TB vs 16TB of actual space used in >> >> directory. >> >> >> >> 2. Reading through some email posts and issue trackers led me to >> >> disabling 'standby replay' though we are not and have not ever used >> >> snapshots. Disabling that feature on our 3 MDS stopped the steady >> >> climb. However the pool remains with 300TB of unaccounted for space >> >> usage. http://tracker.ceph.com/issues/19593 and >> >> http://tracker.ceph.com/issues/21551 >> > >> > This is pretty strange -- if you were already on 12.2.5 then the >> > http://tracker.ceph.com/issues/19593 should have been fixed and >> > switching standby replays on/off shouldn't make a difference (unless >> > there's some similar bug that crept back into luminous). >> > >> >> 3. I've never had any issue starting the MDS or with filesystem >> >> functionality but looking through the mds logs I see a single >> >> 'journaler.pg(rw) _decode error from assimilate_prefetch' at every >> >> startup. A log snippet with context is below with debug_mds and >> >> debug_journaler at 20. >> > >> > This message suggests that the purge queue has been corrupted, but the >> > MDS is ignoring this -- something is wrong with the error handling. >> > The MDS should be marked damaged when something like this happens, but >> > in this case PurgeQueue is apparently dropping the error on the floor >> > after it gets logged by Journaler. I've opened a ticket+PR for the >> > error handling here: http://tracker.ceph.com/issues/24533 (however, >> > the loading path in PurgeQueue::_recover *does* have error handling so >> > I'm not clear why that isn't happening in your case). >> > >> > I believe cephfs-journal-tool in mimic was enhanced to be able to >> > optionally operate on the purge queue as well as the metadata journal >> > (they use the same underlying format), so upgrading to mimic would >> > give you better tooling for debugging this. >> > >> > John >> > >> > >> >> As noted, there is at least one past email thread on the topic but I'm >> >> not quite having the same issue as this person and I couldn't glean >> >> any information as to what I should do to repair this error and get >> >> stale objects purged from this pool (if that is in fact the issue): >> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021379.html >> >> >> >> Any thoughts on troubleshooting steps I could try next? >> >> >> >> Here is the log snippet: >> >> >> >> 2018-06-15 09:14:50.746831 7fb47251b700 20 mds.0.journaler.pq(rw) >> >> write_buf_throttle get, delta 101 >> >> 2018-06-15 09:14:50.746835 7fb47251b700 10 mds.0.journaler.pq(rw) >> >> append_entry len 81 to 88121773~101 >> >> 2018-06-15 09:14:50.746838 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch >> >> 2018-06-15 09:14:50.746863 7fb47251b700 20 mds.0.journaler.pq(rw) >> >> write_buf_throttle get, delta 101 >> >> 2018-06-15 09:14:50.746864 7fb47251b700 10 mds.0.journaler.pq(rw) >> >> append_entry len 81 to 88121874~101 >> >> 2018-06-15 09:14:50.746867 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch >> >> 2018-06-15 09:14:50.746901 7fb46fd16700 10 mds.0.journaler.pq(rw) >> >> _finish_read got 6822392~1566216 >> >> 2018-06-15 09:14:50.746909 7fb46fd16700 10 mds.0.journaler.pq(rw) >> >> _assimilate_prefetch 6822392~1566216 >> >> 2018-06-15 09:14:50.746911 7fb46fd16700 10 mds.0.journaler.pq(rw) >> >> _assimilate_prefetch gap of 4194304 from received_pos 8388608 to first >> >> prefetched buffer 12582912 >> >> 2018-06-15 09:14:50.746913 7fb46fd16700 10 mds.0.journaler.pq(rw) >> >> _assimilate_prefetch read_buf now 6822392~1566216, read pointers >> >> 6822392/8388608/50331648 >> >> >> >> === error here ===> 2018-06-15 09:14:50.746965 7fb46fd16700 -1 >> >> mds.0.journaler.pq(rw) _decode error from assimilate_prefetch >> >> >> >> 2018-06-15 09:14:50.746994 7fb47251b700 20 mds.0.journaler.pq(rw) >> >> write_buf_throttle get, delta 101 >> >> 2018-06-15 09:14:50.746998 7fb47251b700 10 mds.0.journaler.pq(rw) >> >> append_entry len 81 to 88121975~101 >> >> 2018-06-15 09:14:50.747007 7fb47251b700 10 mds.0.journaler.pq(rw) >> >> wait_for_readable at 6822392 onreadable 0x557ee0f58300 >> >> 2018-06-15 09:14:50.747042 7fb47251b700 20 mds.0.journaler.pq(rw) >> >> write_buf_throttle get, delta 101 >> >> 2018-06-15 09:14:50.747043 7fb47251b700 10 mds.0.journaler.pq(rw) >> >> append_entry len 81 to 88122076~101 >> >> 2018-06-15 09:14:50.747063 7fb47251b700 20 mds.0.journaler.pq(rw) >> >> write_buf_throttle get, delta 101 >> >> 2018-06-15 09:14:50.747064 7fb47251b700 10 mds.0.journaler.pq(rw) >> >> append_entry len 81 to 88122177~101 >> >> 2018-06-15 09:14:50.747113 7fb47251b700 20 mds.0.journaler.pq(rw) >> >> write_buf_throttle get, delta 101 >> >> 2018-06-15 09:14:50.747114 7fb47251b700 10 mds.0.journaler.pq(rw) >> >> append_entry len 81 to 88122278~101 >> >> 2018-06-15 09:14:50.747136 7fb47251b700 20 mds.0.journaler.pq(rw) >> >> write_buf_throttle get, delta 101 >> >> _______________________________________________ >> >> ceph-users mailing list >> >> ceph-users@xxxxxxxxxxxxxx >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com