MDS stuck in stopping state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all, I needed to reduce the number of active MDS daemons from 4 to 1. Unfortunately, the last MDS to stop is stuck in stopping state. Ceph version is mimic 13.2.10. Each MDS has 3 blocked OPS, that seem to be related to deleted snapshots; more info below. I failed the MDS in stopping state already several times in the hope that the operations get flushed out. Before failing rank 0, I would appreciate if someone could look at this issue and advise on how to proceed safely.

Some diagnostic info:

# ceph fs status
con-fs2 - 1659 clients
=======
+------+----------+---------+---------------+-------+-------+
| Rank |  State   |   MDS   |    Activity   |  dns  |  inos |
+------+----------+---------+---------------+-------+-------+
|  0   |  active  | ceph-08 | Reqs:  176 /s | 2844k | 2775k |
|  1   | stopping | ceph-17 |               | 27.7k |   59  |
+------+----------+---------+---------------+-------+-------+
+---------------------+----------+-------+-------+
|         Pool        |   type   |  used | avail |
+---------------------+----------+-------+-------+
|    con-fs2-meta1    | metadata |  555M | 1261G |
|    con-fs2-meta2    |   data   |    0  | 1261G |
|     con-fs2-data    |   data   | 1321T | 5756T |
| con-fs2-data-ec-ssd |   data   |  252G | 4035G |
|    con-fs2-data2    |   data   |  389T | 5233T |
+---------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ceph-09   |
|   ceph-24   |
|   ceph-14   |
|   ceph-16   |
|   ceph-12   |
|   ceph-23   |
|   ceph-10   |
|   ceph-15   |
|   ceph-13   |
|   ceph-11   |
+-------------+
MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)

# ceph status
  cluster:
    id:     
    health: HEALTH_WARN
            2 MDSs report slow requests
 
  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26
    mgr: ceph-01(active), standbys: ceph-02, ceph-03, ceph-25, ceph-26
    mds: con-fs2-2/2/1 up  {0=ceph-08=up:active,1=ceph-17=up:stopping}, 10 up:standby
    osd: 1051 osds: 1050 up, 1050 in
 
  data:
    pools:   13 pools, 17374 pgs
    objects: 1.01 G objects, 1.9 PiB
    usage:   2.3 PiB used, 9.2 PiB / 11 PiB avail
    pgs:     17352 active+clean
             20    active+clean+scrubbing+deep
             2     active+clean+scrubbing
 
  io:
    client:   129 MiB/s rd, 175 MiB/s wr, 2.57 kop/s rd, 2.77 kop/s wr

# ceph health detail
HEALTH_WARN 2 MDSs report slow requests
MDS_SLOW_REQUEST 2 MDSs report slow requests
    mdsceph-08(mds.0): 3 slow requests are blocked > 30 secs
    mdsceph-17(mds.1): 3 slow requests are blocked > 30 secs

# ssh ceph-08 ceph daemon mds.ceph-08 dump_blocked_ops
{
    "ops": [
        {
            "description": "client_request(mds.1:126521 rename #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, caller_gid=0{})",
            "initiated_at": "2021-12-13 13:08:59.430597",
            "age": 5034.983083,
            "duration": 5034.983109,
            "type_data": {
                "flag_point": "acquired locks",
                "reqid": "mds.1:126521",
                "op_type": "client_request",
                "client_info": {
                    "client": "mds.1",
                    "tid": 126521
                },
                "events": [
                    {
                        "time": "2021-12-13 13:08:59.430597",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-12-13 13:08:59.430597",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-12-13 13:08:59.430597",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-12-13 13:08:59.430601",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-12-13 13:09:00.730197",
                        "event": "dispatched"
                    },
                    {
                        "time": "2021-12-13 13:09:01.517306",
                        "event": "requesting remote authpins"
                    },
                    {
                        "time": "2021-12-13 13:09:01.557219",
                        "event": "failed to xlock, waiting"
                    },
                    {
                        "time": "2021-12-13 13:09:01.647692",
                        "event": "failed to wrlock, waiting"
                    },
                    {
                        "time": "2021-12-13 13:09:01.663629",
                        "event": "waiting for remote wrlocks"
                    },
                    {
                        "time": "2021-12-13 13:09:01.673789",
                        "event": "waiting for remote wrlocks"
                    },
                    {
                        "time": "2021-12-13 13:09:01.676523",
                        "event": "failed to xlock, waiting"
                    },
                    {
                        "time": "2021-12-13 13:09:01.691962",
                        "event": "failed to xlock, waiting"
                    },
                    {
                        "time": "2021-12-13 13:09:01.704202",
                        "event": "acquired locks"
                    }
                ]
            }
        },
        {
            "description": "client_request(mds.1:1 rename #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, caller_gid=0{})",
            "initiated_at": "2021-12-13 13:31:56.260453",
            "age": 3658.153227,
            "duration": 3658.153337,
            "type_data": {
                "flag_point": "requesting remote authpins",
                "reqid": "mds.1:1",
                "op_type": "client_request",
                "client_info": {
                    "client": "mds.1",
                    "tid": 1
                },
                "events": [
                    {
                        "time": "2021-12-13 13:31:56.260453",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260453",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260454",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260461",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260511",
                        "event": "dispatched"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260604",
                        "event": "requesting remote authpins"
                    }
                ]
            }
        },
        {
            "description": "client_request(mds.1:993 rename #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, caller_gid=0{})",
            "initiated_at": "2021-12-13 13:15:31.979997",
            "age": 4642.433683,
            "duration": 4642.433850,
            "type_data": {
                "flag_point": "requesting remote authpins",
                "reqid": "mds.1:993",
                "op_type": "client_request",
                "client_info": {
                    "client": "mds.1",
                    "tid": 993
                },
                "events": [
                    {
                        "time": "2021-12-13 13:15:31.979997",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-12-13 13:15:31.979997",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-12-13 13:15:31.979998",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-12-13 13:15:31.980003",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-12-13 13:15:31.980079",
                        "event": "dispatched"
                    },
                    {
                        "time": "2021-12-13 13:15:31.980174",
                        "event": "requesting remote authpins"
                    },
                    {
                        "time": "2021-12-13 13:31:50.634734",
                        "event": "requesting remote authpins"
                    }
                ]
            }
        }
    ],
    "complaint_time": 30.000000,
    "num_blocked_ops": 3
}

# ssh ceph-17 ceph daemon mds.ceph-17 dump_blocked_ops
{
    "ops": [
        {
            "description": "rejoin:mds.1:126521",
            "initiated_at": "2021-12-13 13:30:34.602931",
            "age": 3791.164314,
            "duration": 3791.164335,
            "type_data": {
                "flag_point": "dispatched",
                "reqid": "mds.1:126521",
                "op_type": "no_available_op_found",
                "events": [
                    {
                        "time": "2021-12-13 13:30:34.602931",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-12-13 13:30:34.602931",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-12-13 13:30:34.602932",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-12-13 13:30:34.602978",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-12-13 13:30:34.605856",
                        "event": "dispatched"
                    }
                ]
            }
        },
        {
            "description": "slave_request(mds.1:993.0 authpin)",
            "initiated_at": "2021-12-13 13:31:50.634857",
            "age": 3715.132388,
            "duration": 3715.132451,
            "type_data": {
                "flag_point": "dispatched",
                "reqid": "mds.1:993",
                "op_type": "slave_request",
                "master_info": {
                    "master": "mds.0"
                },
                "request_info": {
                    "attempt": 0,
                    "op_type": "authpin",
                    "lock_type": 0,
                    "object_info": "0x1000eec35f7.head",
                    "srcdnpath": "",
                    "destdnpath": "",
                    "witnesses": "",
                    "has_inode_export": false,
                    "inode_export_v": 0,
                    "op_stamp": "0.000000"
                },
                "events": [
                    {
                        "time": "2021-12-13 13:31:50.634857",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-12-13 13:31:50.634857",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-12-13 13:31:50.634858",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-12-13 13:31:50.634867",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-12-13 13:31:50.634893",
                        "event": "dispatched"
                    }
                ]
            }
        },
        {
            "description": "slave_request(mds.1:1.0 authpin)",
            "initiated_at": "2021-12-13 13:31:56.260729",
            "age": 3709.506516,
            "duration": 3709.506631,
            "type_data": {
                "flag_point": "dispatched",
                "reqid": "mds.1:1",
                "op_type": "slave_request",
                "master_info": {
                    "master": "mds.0"
                },
                "request_info": {
                    "attempt": 0,
                    "op_type": "authpin",
                    "lock_type": 0,
                    "object_info": "0x1000eec35f7.head",
                    "srcdnpath": "",
                    "destdnpath": "",
                    "witnesses": "",
                    "has_inode_export": false,
                    "inode_export_v": 0,
                    "op_stamp": "0.000000"
                },
                "events": [
                    {
                        "time": "2021-12-13 13:31:56.260729",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260729",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260731",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-12-13 13:31:56.260743",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-12-13 13:31:56.264063",
                        "event": "dispatched"
                    }
                ]
            }
        }
    ],
    "complaint_time": 30.000000,
    "num_blocked_ops": 3
}

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux