CephFS health warnings after deleting millions of files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A rogue process wrote 38M files into a single CephFS directory that took
about a month to delete. We had to increase MDS cache sizes to handle the
increased file volume, but we've been able to reduce all of our settings
back to default.

Ceph cluster is 15.2.11. Cephfs clients are ceph-fuse either
version 14.2.16 or 15.2.11 depending if they've been upgraded yet. Nothing
has changed in the last ~6 months in regards to client versions or cluster
version.

We are currently dealing with 2 issues now that things seem to be cleaned
up.

1. MDSs report slow requests. [1] Dumping the blocked requests has the same
output for all of them. They seemingly get stuck AFTER the event succeeds
to acquire locks. I can't find any information about what's happening after
this or why things are getting stuck here.

2. Clients failing to advance oldest client/flush tid. There are 2 clients
that are the worst offenders for this, but a few other clients are having
this same issue. All of the clients having this issue are on 14.2.16, but
we also have a hundred clients on the same version that don't have this
issue at all. [2] The logs make it look like the clients just have a bad
integer/pointer somehow. We can clean up the error by remounting the
filesystem or rebooting the server, but these 2 clients in particular keep
ending up in this state. No other repeat offenders yet, but we've had 4
other servers in this state over the last couple weeks.

Are there any ideas what the next steps might be for diagnosing either of
these issues? Thank you.

-David Turner



[1] $ sudo ceph daemon mds.mon1 dump_blocked_ops
{
    "ops": [
        {
            "description": "client_request(client.17709580:39254 open
#0x10001c99cd4 2022-02-22T16:25:40.231547+0000 caller_uid=0,
caller_gid=0{})",
            "initiated_at": "2022-04-19T19:07:10.663552+0000",
            "age": 90.920778446,
            "duration": 90.920806244000005,
            "type_data": {
                "flag_point": "acquired locks",
                "reqid": "client.17709580:39254",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.17709580",
                    "tid": 39254
                },
                "events": [
                    {
                        "time": "2022-04-19T19:07:10.663552+0000",
                        "event": "initiated"
                    },
                    {
                        "time": "2022-04-19T19:07:10.663549+0000",
                        "event": "throttled"
                    },
                    {
                        "time": "2022-04-19T19:07:10.663552+0000",
                        "event": "header_read"
                    },
                    {
                        "time": "2022-04-19T19:07:10.663555+0000",
                        "event": "all_read"
                    },
                    {
                        "time": "2022-04-19T19:07:10.665744+0000",
                        "event": "dispatched"
                    },
                    {
                        "time": "2022-04-19T19:07:10.773894+0000",
                        "event": "failed to xlock, waiting"
                    },
                    {
                        "time": "2022-04-19T19:07:10.807249+0000",
                        "event": "acquired locks"
                    }
                ]
            }
        },


[2] 2022-04-19 06:15:36.108 7fb28b7fe700  0 client.30095002
handle_cap_flush_ack mds.1 got unexpected flush ack tid 338611 expected is 0
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux