Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ugh, something went horribly wrong. I've downloaded the MDS logs that
contain assertion failure and it looks relevant to this [0]. Do you have
client logs for this?

The other log that you shared is being downloaded right now, once that's
done and I'm done going through it, I'll update you.

[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote:

> Hi Dhairya,
>
> Sorry to resurrect this thread again, but we still unfortunately have an
> issue with our filesystem after we attempted to write new backups to it.
>
> We finished the scrub of the filesystem on Friday and ran a repair scrub
> on the 1 directory which had metadata damage. After doing so and rebooting,
> the cluster reported no issues and data was accessible again.
>
> We re-started the backups to run over the weekend and unfortunately the
> filesystem crashed again where the log of the failure is here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
> We ran the backups on kernel mounts of the filesystem without the nowsync
> option this time to avoid the out-of-sync write problems..
>
> I've tried resetting the journal again after recovering the dentries but
> unfortunately the filesystem is still in a failed state despite setting
> joinable to true. The log of this crash is here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708
> .
>
> I'm not sure how to proceed as I can't seem to get any MDS to take over
> the first rank. I would like to do a scrub of the filesystem and preferably
> overwrite the troublesome files with the originals on the live filesystem.
> Do you have any advice on how to make the filesystem leave its failed
> state? I have a backup of the journal before I reset it so I can roll back
> if necessary.
>
> Here are some details about the filesystem at present:
>
> root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
>   cluster:
>     id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>     health: HEALTH_ERR
>             1 filesystem is degraded
>             1 large omap objects
>             1 filesystem is offline
>             1 mds daemon damaged
>
> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set
>             1750 pgs not deep-scrubbed in time
>             1612 pgs not scrubbed in time
>
>   services:
>     mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4
> (age 50m)
>     mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3,
> pebbles-s4
>     mds: 1/2 daemons up, 3 standby
>     osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped
> pgs
>          flags
> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
>
>   data:
>     volumes: 1/2 healthy, 1 recovering; 1 damaged
>     pools:   7 pools, 2177 pgs
>     objects: 3.24G objects, 6.7 PiB
>     usage:   8.6 PiB used, 14 PiB / 23 PiB avail
>     pgs:     11785954/27384310061 objects misplaced (0.043%)
>              2167 active+clean
>              6    active+remapped+backfilling
>              4    active+remapped+backfill_wait
>
> ceph_backup - 0 clients
> ===========
> RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
>  0    failed
>         POOL            TYPE     USED  AVAIL
>    mds_backup_fs      metadata  1174G  3071G
> ec82_primary_fs_data    data       0   3071G
>       ec82pool          data    8085T  4738T
> ceph_archive - 2 clients
> ============
> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
>  0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105    118      2
>         POOL            TYPE     USED  AVAIL
>    mds_archive_fs     metadata  5184M  3071G
> ec83_primary_fs_data    data       0   3071G
>       ec83pool          data     138T  4307T
> STANDBY MDS
>  pebbles-s2
>  pebbles-s3
>  pebbles-s1
> MDS version: ceph version 17.2.7
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
> root@pebbles-s2 11:55 [~]: ceph fs dump
> e2643889
> enable_multiple, ever_enabled_multiple: 1,1
> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> legacy client fscid: 1
>
> Filesystem 'ceph_backup' (1)
> fs_name    ceph_backup
> epoch    2643888
> flags    12 joinable allow_snaps allow_multimds_snaps
> created    2023-05-19T12:52:36.302135+0100
> modified    2024-07-08T11:17:55.437861+0100
> tableserver    0
> root    0
> session_timeout    60
> session_autoclose    300
> max_file_size    109934182400000
> required_client_features    {}
> last_failure    0
> last_failure_osd_epoch    494515
> compat    compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> max_mds    1
> in    0
> up    {}
> failed
> damaged    0
> stopped
> data_pools    [6,3]
> metadata_pool    2
> inline_data    disabled
> balancer
> standby_count_wanted    1
>
>
> Kindest regards,
>
> Ivan
> On 28/06/2024 15:17, Dhairya Parmar wrote:
>
> CAUTION: This email originated from outside of the LMB:
> *.-dparmar@xxxxxxxxxx-.*
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe.
> If you think this is a phishing email, please forward it to
> phishing@xxxxxxxxxxxxxxxxx
>
>
> --
>
>
> On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
> wrote:
>
>> Hi Dhairya,
>>
>> I would be more than happy to share our corrupted journal. Has the host
>> key changed for drop.ceph.com? The fingerprint I'm being sent is
>> 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the one
>> in our /usr/share/ceph/known_hosts_drop.ceph.com.
>>
> Ah, strange. Let me get in touch with folks who might know about this,
> will revert back to you ASAP
>
>> Thank you for your advice as well. We've reset our MDS' journal and are
>> currently in the process of a full filesystem scrub which understandably is
>> taking quite a bit of time but seems to be progressing through the objects
>> fine.
>>
> YAY!
>
>> Thank you ever so much for all your help and please do feel free to
>> follow up with us if you would like any further details about our crash!
>>
> Glad to hear it went well, this bug is being worked on with high priority
> and once the patch is ready, it will be backported.
>
> The root cause of this issue is the `nowsync` (async dirops) being enabled
> by default with kclient [0]. This feature allows asynchronous creation and
> deletion of files, optimizing performance by avoiding round-trip latency
> for these system calls. However, in very rare cases (like yours :D), it can
> affect the system's consistency and stability hence if this kind of
> optimization is not a priority for your workload, I recommend turning it
> off by switching the mount points to `wsync` and also set the MDS config
> `mds_client_delegate_inos_pct` to `0` so that you don't end up in this
> situation again (until the bug fix arrives :)).
>
> [0]
> https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902
>
>> Kindest regards,
>>
>> Ivan
>> On 27/06/2024 12:39, Dhairya Parmar wrote:
>>
>> CAUTION: This email originated from outside of the LMB:
>> *.-dparmar@xxxxxxxxxx-.*
>> Do not click links or open attachments unless you recognize the sender
>> and know the content is safe.
>> If you think this is a phishing email, please forward it to
>> phishing@xxxxxxxxxxxxxxxxx
>>
>>
>> --
>> Hi Ivan,
>>
>> The solution (which has been successful for us in the past) is to reset
>> the journal. This would bring the fs back online and return the MDSes to a
>> stable state, but some data would be lost—the data in the journal that
>> hasn't been flushed to the backing store would be gone. Therefore, you
>> should try to flush out as much journal data as possible before resetting
>> the journal.
>>
>> Here are the steps for this entire process:
>>
>> 1) Bring the FS offline
>> $ ceph fs fail <fs_name>
>>
>> 2) Recover dentries from journal (run it with every MDS Rank)
>> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event recover_dentries
>> summary
>>
>> 3) Reset the journal (again with every MDS Rank)
>> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset
>>
>> 4) Bring the FS online
>> $ cephfs fs set <fs_name> joinable true
>>
>> 5) Restart the MDSes
>>
>> 6) Perform scrub to ensure consistency of fs
>> $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag]
>> # you could try a recursive scrub maybe `ceph tell mds.<fs_name>:0 scrub
>> start / recursive`
>>
>> Some important notes to keep in mind:
>> * Recovering dentries will take time (generally, rank 0 is the most
>> time-consuming, but the rest should be quick).
>> * cephfs-journal-tool and metadata OSDs are bound to use a significant
>> CPU percentage. This is because cephfs-journal-tool has to swig the journal
>> data and flush it out to the backing store, which also makes the metadata
>> operations go rampant, resulting in OSDs taking a significant percentage of
>> CPU.
>>
>> Do let me know how this goes.
>>
>> On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
>> wrote:
>>
>>> Hi Dhairya,
>>>
>>> We can induce the crash by simply restarting the MDS and the crash seems
>>> to happen when an MDS goes from up:standby to up:replay. The MDS works
>>> through a few files in the log before eventually crashing where I've
>>> included the logs for this here (this is after I imported the backed up
>>> journal which I hope was successful but please let me know if you suspect
>>> it wasn't!):
>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log
>>>
>>> With respect to the client logs, are you referring to the clients who
>>> are writing to the filesystem? We don't typically run them in any sort of
>>> debug mode and we have quite a few machines running our backup system but
>>> we can look an hour or so before the first MDS crash (though I don't know
>>> if this is when the de-sync occurred). Here are some MDS logs with regards
>>> to the initial crash on Saturday morning though which may be helpful:
>>>
>>>    -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick
>>>    -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient:
>>> _check_auth_rotating have uptodate secrets (they expire after
>>> 2024-06-22T05:41:13.091556+0100)
>>>    -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1 mds.pebbles-s2
>>> Updating MDS map to version 2529650 from mon.3
>>>    -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
>>> operator():  data pool 6 not found in OSDMap
>>>    -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
>>> operator():  data pool 3 not found in OSDMap
>>>    -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
>>> asok(0x5592e7968000) register_command objecter_requests hook 0x5592e78f8800
>>>    -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>>> _renew_subs
>>>    -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>>>    -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
>>> log_channel(cluster) update_config to_monitors: true to_syslog: false
>>> syslog_facility:  prio: info to_graylog: false graylog_host: 127.0.0.1
>>> graylog_port: 12201)
>>>    -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
>>> operator():  data pool 6 not found in OSDMap
>>>    -49> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
>>> operator():  data pool 3 not found in OSDMap
>>>    -48> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.0
>>> apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0)
>>>    -47> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>>> handle_mds_map i am now mds.0.2529650
>>>    -46> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>>> handle_mds_map state change up:standby --> up:replay
>>>    -45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
>>> mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay
>>>    -44> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>>> replay_start
>>>    -43> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>>> waiting for osdmap 473739 (which blocklists prior instance)
>>>    -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>>>    -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2 mds.0.cache Memory
>>> usage:  total 299012, rss 37624, heap 182556, baseline 182556, 0 / 0 inodes
>>> have caps, 0 caps, 0 caps per inode
>>>    -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
>>> _renew_subs
>>>    -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
>>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>>>    -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
>>> handle_get_version_reply finishing 1 version 473739
>>>    -37> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 0: opening inotable
>>>    -36> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 0: opening sessionmap
>>>    -35> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 0: opening mds log
>>>    -34> 2024-06-22T05:41:43.224+0100 7f1847e78700  5 mds.0.log open
>>> discovering log bounds
>>>    -33> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 0: opening purge queue (async)
>>>    -32> 2024-06-22T05:41:43.224+0100 7f1847e78700  4 mds.0.purge_queue
>>> open: opening
>>>    -31> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
>>> mds.0.journaler.pq(ro) recover start
>>>    -30> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
>>> mds.0.journaler.pq(ro) read_head
>>>    -29> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 0: loading open file table (async)
>>>    -28> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 0: opening snap table
>>>    -27> 2024-06-22T05:41:43.224+0100 7f1847677700  4
>>> mds.0.journalpointer Reading journal pointer '400.00000000'
>>>    -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient:
>>> get_auth_request con 0x5592e8987000 auth_method 0
>>>    -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10 monclient:
>>> get_auth_request con 0x5592e8987c00 auth_method 0
>>>    -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
>>> mds.0.journaler.pq(ro) _finish_read_head loghead(trim 231160676352, expire
>>> 231163662875, write 231163662875, stream_format 1).  probing for end of log
>>> (from 231163662875)...
>>>    -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
>>> mds.0.journaler.pq(ro) probing for end of the log
>>>    -22> 2024-06-22T05:41:43.252+0100 7f1847677700  1
>>> mds.0.journaler.mdlog(ro) recover start
>>>    -21> 2024-06-22T05:41:43.252+0100 7f1847677700  1
>>> mds.0.journaler.mdlog(ro) read_head
>>>    -20> 2024-06-22T05:41:43.252+0100 7f1847677700  4 mds.0.log Waiting
>>> for journal 0x200 to recover...
>>>    -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10 monclient:
>>> get_auth_request con 0x5592e8bc6000 auth_method 0
>>>    -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10 monclient:
>>> get_auth_request con 0x5592e8bc6800 auth_method 0
>>>    -17> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
>>> mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 90131453181952,
>>> expire 90131465778558, write 90132009715463, stream_format 1).  probing for
>>> end of log (from 90132009715463)...
>>>    -16> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
>>> mds.0.journaler.mdlog(ro) probing for end of the log
>>>    -15> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
>>> mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 90132019384791
>>> (header had 90132009715463). recovered.
>>>    -14> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log Journal
>>> 0x200 recovered.
>>>    -13> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log Recovered
>>> journal 0x200 in format 1
>>>    -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
>>> mds.0.journaler.pq(ro) _finish_probe_end write_pos = 231163662875 (header
>>> had 231163662875). recovered.
>>>    -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700  4 mds.0.purge_queue
>>> operator(): open complete
>>>    -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
>>> mds.0.journaler.pq(ro) set_writeable
>>>     -9> 2024-06-22T05:41:43.441+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 1: loading/discovering base inodes
>>>     -8> 2024-06-22T05:41:43.441+0100 7f1847e78700  0 mds.0.cache
>>> creating system inode with ino:0x100
>>>     -7> 2024-06-22T05:41:43.442+0100 7f1847e78700  0 mds.0.cache
>>> creating system inode with ino:0x1
>>>     -6> 2024-06-22T05:41:43.442+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 2: replaying mds log
>>>     -5> 2024-06-22T05:41:43.442+0100 7f1847e78700  2 mds.0.2529650
>>> Booting: 2: waiting for purge queue recovered
>>>     -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: tick
>>>     -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient:
>>> _check_auth_rotating have uptodate secrets (they expire after
>>> 2024-06-22T05:41:14.091638+0100)
>>>     -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700  2 mds.0.cache Memory
>>> usage:  total 588368, rss 308304, heap 207132, baseline 182556, 0 / 15149
>>> inodes have caps, 0 caps, 0 caps per inode
>>>     -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>>> In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T,
>>> T)>) [with T = inodeno_t; C = std::map]' thread 7f1846675700 time
>>> 2024-06-22T05:41:44.643146+0100
>>>
>>>  ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
>>> (stable)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x135) [0x7f18568b64a3]
>>>  2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669]
>>>  3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
>>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885]
>>>  4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
>>> MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
>>>  5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
>>>  6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
>>>  7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
>>>  8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
>>>  9: clone()
>>>
>>>      0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 *** Caught signal
>>> (Aborted) **
>>>  in thread 7f1846675700 thread_name:md_log_replay
>>>
>>>  ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
>>> (stable)
>>>  1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0]
>>>  2: gsignal()
>>>  3: abort()
>>>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x18f) [0x7f18568b64fd]
>>>  5: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669]
>>>  6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
>>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885]
>>>  7: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
>>> MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
>>>  8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
>>>  9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
>>>  10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
>>>  11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
>>>  12: clone()
>>>
>>> We have a relatively low debug setting normally so I don't think many
>>> details of the initial crash were captured unfortunately and the MDS logs
>>> before the above (i.e. "-60" and older) are just beacon messages and
>>> _check_auth_rotating checks.
>>>
>>> I was wondering whether you have any recommendations in terms of what
>>> actions we could take to bring our filesystem back into a working state
>>> short of rebuilding the entire metadata pool? We are quite keen to bring
>>> our backup back into service urgently as we currently do not have any
>>> accessible backups for our Ceph clusters.
>>>
>>> Kindest regards,
>>>
>>> Ivan
>>> On 25/06/2024 19:18, Dhairya Parmar wrote:
>>>
>>> CAUTION: This email originated from outside of the LMB:
>>> *.-dparmar@xxxxxxxxxx-.*
>>> Do not click links or open attachments unless you recognize the sender
>>> and know the content is safe.
>>> If you think this is a phishing email, please forward it to
>>> phishing@xxxxxxxxxxxxxxxxx
>>>
>>>
>>> --
>>>
>>>
>>> On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
>>> wrote:
>>>
>>>> Hi Dhairya,
>>>>
>>>> Thank you for your rapid reply. I tried recovering the dentries for the
>>>> file just before the crash I mentioned before and then splicing the
>>>> transactions from the journal which seemed to remove that issue for that
>>>> inode but resulted in the MDS crashing on the next inode in the
>>>> journal when performing replay.
>>>>
>>> The MDS delegates a range of preallocated inodes (in form of a set -
>>> interval_set<inodeno_t> preallocated_inos) to the clients, so it can be one
>>> inode that is untracked or some inodes from the range or in worst case
>>> scenario - ALL, and this is something that even the `cephfs-journal-tool`
>>> would not be able to tell (since we're talking about MDS internals which
>>> aren't exposed to such tools). That is the reason why you see "MDS crashing
>>> on the next inode in the journal when performing replay".
>>>
>>> An option could be to expose the inode set to some tool or asok cmd to
>>> identify such inodes ranges, which needs to be discussed. For now, we're
>>> trying to address this in [0], you can follow the discussion there.
>>>
>>> [0] https://tracker.ceph.com/issues/66251
>>>
>>>> Removing all the transactions involving the directory housing the files
>>>> that seemed to cause these crashes from the journal only caused the MDS to
>>>> fail to even start replay.
>>>>
>>> I've rolled back our journal to our original version when the crash
>>>> first happened and the entire MDS log for the crash can be found here:
>>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24
>>>>
>>> Awesome, this would help us a ton. Apart from this, would it be possible
>>> to send us client logs?
>>>
>>>> Please let us know if you would like any other logs file as we can
>>>> easily induce this crash.
>>>>
>>> Since you can easily induce the crash, can you share the reproducer
>>> please i.e. what all action you take in order to hit this?
>>>
>>>> Kindest regards,
>>>>
>>>> Ivan
>>>> On 25/06/2024 09:58, Dhairya Parmar wrote:
>>>>
>>>> CAUTION: This email originated from outside of the LMB:
>>>> *.-dparmar@xxxxxxxxxx-.*
>>>> Do not click links or open attachments unless you recognize the sender
>>>> and know the content is safe.
>>>> If you think this is a phishing email, please forward it to
>>>> phishing@xxxxxxxxxxxxxxxxx
>>>>
>>>>
>>>> --
>>>> Hi Ivan,
>>>>
>>>> This looks to be similar to the issue [0] that we're already addressing
>>>> at [1]. So basically there is some out-of-sync event that led the client to
>>>> make use of the inodes that MDS wasn't aware of/isn't tracking and hence
>>>> the crash. It'd be really helpful if you can provide us more logs.
>>>>
>>>> CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar
>>>> <vshankar@xxxxxxxxxx> @Patrick Donnelly <pdonnell@xxxxxxxxxx> @Xiubo Li
>>>> <xiubli@xxxxxxxxxx>
>>>>
>>>> [0] https://tracker.ceph.com/issues/61009
>>>> [1] https://tracker.ceph.com/issues/66251
>>>> --
>>>> *Dhairya Parmar*
>>>>
>>>> Associate Software Engineer, CephFS
>>>>
>>>> <https://www.redhat.com/>IBM, Inc.
>>>>
>>>> On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have been experiencing a serious issue with our CephFS backup
>>>>> cluster
>>>>> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
>>>>> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem
>>>>> are
>>>>> constantly in a "replay" or "replay(laggy)" state and keep crashing.
>>>>>
>>>>> We have a single MDS filesystem called "ceph_backup" with 2 standby
>>>>> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
>>>>> little to no data) where we are using our "ceph_backup" filesystem to
>>>>> backup our data and this is the one which is currently broken. The
>>>>> Ceph
>>>>> health outputs currently are:
>>>>>
>>>>>     root@pebbles-s1 14:05 [~]: ceph -s
>>>>>        cluster:
>>>>>          id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>>>>>          health: HEALTH_WARN
>>>>>                  1 filesystem is degraded
>>>>>                  insufficient standby MDS daemons available
>>>>>                  1319 pgs not deep-scrubbed in time
>>>>>                  1054 pgs not scrubbed in time
>>>>>
>>>>>        services:
>>>>>          mon: 4 daemons, quorum
>>>>>     pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
>>>>>          mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
>>>>>     pebbles-s3, pebbles-s1
>>>>>          mds: 2/2 daemons up
>>>>>          osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
>>>>>     remapped pgs
>>>>>
>>>>>        data:
>>>>>          volumes: 1/2 healthy, 1 recovering
>>>>>          pools:   7 pools, 2177 pgs
>>>>>          objects: 3.55G objects, 7.0 PiB
>>>>>          usage:   8.9 PiB used, 14 PiB / 23 PiB avail
>>>>>          pgs:     83133528/30006841533 objects misplaced (0.277%)
>>>>>                   2090 active+clean
>>>>>                   47   active+clean+scrubbing+deep
>>>>>                   29   active+remapped+backfilling
>>>>>                   8    active+remapped+backfill_wait
>>>>>                   2    active+clean+scrubbing
>>>>>                   1    active+clean+snaptrim
>>>>>
>>>>>        io:
>>>>>          recovery: 1.9 GiB/s, 719 objects/s
>>>>>
>>>>>     root@pebbles-s1 14:09 [~]: ceph fs status
>>>>>     ceph_backup - 0 clients
>>>>>     ===========
>>>>>     RANK      STATE         MDS      ACTIVITY   DNS    INOS   DIRS CAPS
>>>>>       0    replay(laggy)  pebbles-s3               0      0 0      0
>>>>>              POOL            TYPE     USED  AVAIL
>>>>>         mds_backup_fs      metadata  1255G  2780G
>>>>>     ec82_primary_fs_data    data       0   2780G
>>>>>            ec82pool          data    8442T  3044T
>>>>>     ceph_archive - 2 clients
>>>>>     ============
>>>>>     RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS CAPS
>>>>>       0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105    118 2
>>>>>              POOL            TYPE     USED  AVAIL
>>>>>         mds_archive_fs     metadata  5184M  2780G
>>>>>     ec83_primary_fs_data    data       0   2780G
>>>>>            ec83pool          data     138T  2767T
>>>>>     MDS version: ceph version 17.2.7
>>>>>     (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
>>>>>     root@pebbles-s1 14:09 [~]: ceph health detail | head
>>>>>     HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS
>>>>>     daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not
>>>>>     scrubbed in time
>>>>>     [WRN] FS_DEGRADED: 1 filesystem is degraded
>>>>>          fs ceph_backup is degraded
>>>>>     [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
>>>>>     available
>>>>>          have 0; want 1 more
>>>>>
>>>>> When our cluster first ran after a reboot, Ceph ran through the 2
>>>>> standby MDSes, crashing them all, until it reached the final MDS and
>>>>> is
>>>>> now stuck in this "replay(laggy)" state. Putting our MDSes into
>>>>> debugging mode, we can see that this MDS crashed when replaying the
>>>>> journal for a particular inode (this is the same for all the MDSes and
>>>>> they all crash on the same object):
>>>>>
>>>>>     ...
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
>>>>>     [...539,head]
>>>>>
>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>>>>>     auth fragtree_t(*^2 00*^3 00000*^
>>>>>     4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4
>>>>>     01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4
>>>>> 01110*^4
>>>>>     01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4
>>>>> 10101*^3
>>>>>     10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
>>>>>     T05:41:10.213700+0100 1281276=1281276+0) n(v12
>>>>>     rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1)
>>>>>     old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80]
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay dir 0x1005ba89481.011011000*
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000*
>>>>>
>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>>>>>     [2,head] auth v=436385 cv=0/0 state=107374182
>>>>>     4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12
>>>>>     rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0)
>>>>>     hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay added (full) [dentry
>>>>>
>>>>> #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
>>>>>     ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion
>>>>>     lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1
>>>>>     0x557710444500]
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
>>>>>
>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
>>>>>     005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0
>>>>> n(v0
>>>>>     1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680]
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10
>>>>>     mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35
>>>>> [539,head]
>>>>>
>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
>>>>>     6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth
>>>>>     v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
>>>>>     cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1
>>>>> 0x557710438680]
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay inotable tablev 3112837 <= table 3112837
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>>     EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc
>>>>>     [] used 0x1005cd4fe35
>>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 -1
>>>>>
>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>>>>>     I
>>>>>     n function 'void interval_set<T, C>::erase(T, T,
>>>>>     std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
>>>>>     thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
>>>>>
>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>>>>>     568: FAILED ceph_assert(p->first <= start)
>>>>>
>>>>>       ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
>>>>>     quincy (stable)
>>>>>       1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>>     const*)+0x135) [0x7f8821e814a3]
>>>>>       2: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
>>>>> [0x7f8821e81669]
>>>>>       3: (interval_set<inodeno_t, std::map>::erase(inodeno_t,
>>>>> inodeno_t,
>>>>>     std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885]
>>>>>       4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
>>>>>     MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
>>>>>       5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
>>>>>       6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
>>>>>       7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041]
>>>>>       8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
>>>>>       9: clone()
>>>>>
>>>>> I've only included a short section of the crash (this is the first
>>>>> trace
>>>>> in the log with regards to the crash with a 10/20 debug_mds option).
>>>>> We
>>>>> tried deleting the 0x1005cd4fe35 object from the object store using
>>>>> the
>>>>> "rados" command but this did not allow our MDS to successfully replay.
>>>>>
>>>>>  From my understanding the journal seems okay as we didn't run out of
>>>>> space for example on our metadata pool and "cephfs-journal-tool
>>>>> journal
>>>>> inspect" doesn't seem to think there is any damage:
>>>>>
>>>>>     root@pebbles-s1 13:58 [~]: cephfs-journal-tool
>>>>> --rank=ceph_backup:0
>>>>>     journal inspect
>>>>>     Overall journal integrity: OK
>>>>>     root@pebbles-s1 14:04 [~]: cephfs-journal-tool
>>>>> --rank=ceph_backup:0
>>>>>     event get --inode 1101069090357 summary
>>>>>     Events by type:
>>>>>        OPEN: 1
>>>>>        UPDATE: 3
>>>>>     Errors: 0
>>>>>     root@pebbles-s1 14:05 [~]: cephfs-journal-tool
>>>>> --rank=ceph_backup:0
>>>>>     event get --inode 1101069090357 list
>>>>>     2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE:  (openc)
>>>>>
>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>>     2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
>>>>>     (check_inode_max_size)
>>>>>
>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>>
>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>>     2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
>>>>>
>>>>> FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>>     2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE:  (rename)
>>>>>
>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>>
>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc
>>>>>
>>>>> I was wondering whether anyone had any advice for us on how we should
>>>>> proceed forward? We were thinking about manually applying these events
>>>>> (via "event apply") where failing that we could erase this problematic
>>>>> event with "cephfs-journal-tool --rank=ceph_backup:0 event splice
>>>>> --inode 1101069090357". Is this a good idea? We would rather not
>>>>> rebuild
>>>>> the entire metadata pool if we could avoid it (once was enough for us)
>>>>> as this cluster has ~9 PB of data on it.
>>>>>
>>>>> Kindest regards,
>>>>>
>>>>> Ivan Clayson
>>>>>
>>>>> --
>>>>> Ivan Clayson
>>>>> -----------------
>>>>> Scientific Computing Officer
>>>>> Room 2N249
>>>>> Structural Studies
>>>>> MRC Laboratory of Molecular Biology
>>>>> Francis Crick Ave, Cambridge
>>>>> CB2 0QH
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>> --
>>>> Ivan Clayson
>>>> -----------------
>>>> Scientific Computing Officer
>>>> Room 2N249
>>>> Structural Studies
>>>> MRC Laboratory of Molecular Biology
>>>> Francis Crick Ave, Cambridge
>>>> CB2 0QH
>>>>
>>>> --
>>> Ivan Clayson
>>> -----------------
>>> Scientific Computing Officer
>>> Room 2N249
>>> Structural Studies
>>> MRC Laboratory of Molecular Biology
>>> Francis Crick Ave, Cambridge
>>> CB2 0QH
>>>
>>> --
>> Ivan Clayson
>> -----------------
>> Scientific Computing Officer
>> Room 2N249
>> Structural Studies
>> MRC Laboratory of Molecular Biology
>> Francis Crick Ave, Cambridge
>> CB2 0QH
>>
>> --
> Ivan Clayson
> -----------------
> Scientific Computing Officer
> Room 2N249
> Structural Studies
> MRC Laboratory of Molecular Biology
> Francis Crick Ave, Cambridge
> CB2 0QH
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux