Re: mds slow request, getattr currently failed to rdlock. Kraken with Bluestore

Daniel K <sathackr@xxxxxxxxx> · Wed, 24 May 2017 12:52:18 -0400

Networking is 10Gig. I notice recovery IO is wildly variable, I assume that's normal. 
Very little load as this is yet to go into production, I was "seeing what it would handle" at the time it broke.

I checked this morning and the slow request had gone and I could access the blocked file again.

All OSes are Ubuntu 16.04.01 with the stock 4.4.0-72-generic kernel, and there were two CephFS clients accessing it, also 16.04.1.

Ceph on all is 11.2.0, installed from the debian-kraken repos at download.ceph.com. All OSDs are bluestore.

As of now all is okay, so don't want to waste anyone's time on a wild goose chase.

On Wed, May 24, 2017 at 6:15 AM, John Spray <jspray@xxxxxxxxxx> wrote:
On Tue, May 23, 2017 at 11:41 PM, Daniel K <sathackr@xxxxxxxxx> wrote:

> Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs

> enroute.

>

> I was "beating up" on the cluster, and had been writing to a 6TB file in

> CephFS for several hours, during which I changed the crushmap to better

> match my environment, generating a bunch of recovery IO. After about 5.8TB

> written, one of the OSD(which is also a MON..soon to be rectivied) hosts

> crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph -s:

> (The degraded/misplaced warnings are likely because the cluster hasn't

> completed rebalancing after I changed the crushmap I assume)

>

Losing a quarter of your OSDs down while simultaneously rebalancing

after editing your CRUSH map is a brutal thing to a Ceph cluster, and

I would expect it to impact your client IO severely.

I see that you've got 112MB/s of recovery going on, which may or may

not be saturating some links depending on whether you're using 1gig or

10gig networking.

> 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following dangerous

> and experimental features are enabled: bluestore

> 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following dangerous

> and experimental features are enabled: bluestore

>     cluster e92e20ca-0fe6-4012-86cc-aa51e0466661

>      health HEALTH_WARN

>             440 pgs backfill_wait

>             7 pgs backfilling

>             85 pgs degraded

>             5 pgs recovery_wait

>             85 pgs stuck degraded

>             452 pgs stuck unclean

>             77 pgs stuck undersized

>             77 pgs undersized

>             recovery 196526/3554278 objects degraded (5.529%)

>             recovery 1690392/3554278 objects misplaced (47.559%)

>             mds0: 1 slow requests are blocked > 30 sec

>      monmap e4: 3 mons at

> {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,stor-vm3=10.0.15.53:6789/0}

>             election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3

>       fsmap e21: 1/1/1 up {0=stor-vm4=up:active}

>         mgr active: stor-vm1 standbys: stor-vm2

>      osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs

>             flags sortbitwise,require_jewel_osds,require_kraken_osds

>       pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects

>             11041 GB used, 16901 GB / 27943 GB avail

>             196526/3554278 objects degraded (5.529%)

>             1690392/3554278 objects misplaced (47.559%)

>                  975 active+clean

>                  364 active+remapped+backfill_wait

>                   76 active+undersized+degraded+remapped+backfill_wait

>                    3 active+recovery_wait+degraded+remapped

>                    3 active+remapped+backfilling

>                    3 active+degraded+remapped+backfilling

>                    2 active+recovery_wait+degraded

>                    1 active+clean+scrubbing+deep

>                    1 active+undersized+degraded+remapped+backfilling

> recovery io 112 MB/s, 28 objects/s

>

>

> Seems related to the "corrupted rbd filesystems since jewel" thread.

>

>

> log entries on the MDS server:

>

> 2017-05-23 18:27:12.966218 7f95ed6c0700  0 log_channel(cluster) log [WRN] :

> slow request 243.113407 seconds old, received at 2017-05-23 18:23:09.852729:

> client_request(client.204100:5 getattr pAsLsXsFs #100000003ec 2017-05-23

> 17:48:23.770852 RETRY=2 caller_uid=0, caller_gid=0{}) currently failed to

> rdlock, waiting

>

>

> output of ceph daemon mds.stor-vm4 objecter_requests(changes each time I run

> it)

If that changes each time you run it then it means the OSD requests

from the MDS are happening.

However, it's possible that you have multiple clients and one of them

is stuck trying to write something back (to a PG that is not accepting

the write (yet?)), and thereby preventing the MDS from granting a lock

for another client.

What clients (+versions) are involved, what's the workload, what

versions of Ceph?

John

> :

> root@stor-vm4:/var/log/ceph# ceph daemon mds.stor-vm4 objecter_requests

> {

>     "ops": [

>         {

>             "tid": 66700,

>             "pg": "1.60e95c32",

>             "osd": 4,

>             "object_id": "100000003ec.003efb9f",

>             "object_locator": "@1",

>             "target_object_id": "100000003ec.003efb9f",

>             "target_object_locator": "@1",

>             "paused": 0,

>             "used_replica": 0,

>             "precalc_pgid": 0,

>             "last_sent": "1.47461e+06s",

>             "attempts": 1,

>             "snapid": "head",

>             "snap_context": "0=[]",

>             "mtime": "1969-12-31 19:00:00.000000s",

>             "osd_ops": [

>                 "stat"

>             ]

>         }

>     ],

>     "linger_ops": [],

>     "pool_ops": [],

>     "pool_stat_ops": [],

>     "statfs_ops": [],

>     "command_ops": []

> }

>

>

> I've tried restarting the mds daemon ( systemctl stop ceph-mds\*.service

> ceph-mds.target &&  systemctl start ceph-mds\*.service ceph-mds.target )

>

>

>

> IO to the file that was being access when the host crashed is blocked.

>

>

> Suggestions?

>

>

>

>

>

>

>

>

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com