Re: mds slow request, getattr currently failed to rdlock. Kraken with Bluestore

Daniel K <sathackr@xxxxxxxxx> · Wed, 24 May 2017 17:54:52 -0400

Yes -- the crashed server also mounted cephfs as a client, and also likely had active writes to the file when it crashed.
I have the max file size set to  17,592,186,044,416 -- but this file was about 5.8TB. 

The likely reason for the crash? The file was mounted as a fileio backstore to LIO, which was exported as an FC lun that I had connected to an ESXi server, mapped via RDM to a guest, in which I had a dd if=/dev/zero of=/dev/sdb bs=1M count=6000000000000 running(for several hours)

Which I think was breaking at least 3 "don't do this" rules with ceph. Once it moves into production the pieces will be separated.

On Wed, May 24, 2017 at 4:55 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Wed, May 24, 2017 at 3:15 AM, John Spray <jspray@xxxxxxxxxx> wrote:

> On Tue, May 23, 2017 at 11:41 PM, Daniel K <sathackr@xxxxxxxxx> wrote:

>> Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs

>> enroute.

>>

>> I was "beating up" on the cluster, and had been writing to a 6TB file in

>> CephFS for several hours, during which I changed the crushmap to better

>> match my environment, generating a bunch of recovery IO. After about 5.8TB

>> written, one of the OSD(which is also a MON..soon to be rectivied) hosts

>> crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph -s:

>> (The degraded/misplaced warnings are likely because the cluster hasn't

>> completed rebalancing after I changed the crushmap I assume)

>>

>

> Losing a quarter of your OSDs down while simultaneously rebalancing

> after editing your CRUSH map is a brutal thing to a Ceph cluster, and

> I would expect it to impact your client IO severely.

>

> I see that you've got 112MB/s of recovery going on, which may or may

> not be saturating some links depending on whether you're using 1gig or

> 10gig networking.

>

>> 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following dangerous

>> and experimental features are enabled: bluestore

>> 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following dangerous

>> and experimental features are enabled: bluestore

>>     cluster e92e20ca-0fe6-4012-86cc-aa51e0466661

>>      health HEALTH_WARN

>>             440 pgs backfill_wait

>>             7 pgs backfilling

>>             85 pgs degraded

>>             5 pgs recovery_wait

>>             85 pgs stuck degraded

>>             452 pgs stuck unclean

>>             77 pgs stuck undersized

>>             77 pgs undersized

>>             recovery 196526/3554278 objects degraded (5.529%)

>>             recovery 1690392/3554278 objects misplaced (47.559%)

>>             mds0: 1 slow requests are blocked > 30 sec

>>      monmap e4: 3 mons at

>> {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,stor-vm3=10.0.15.53:6789/0}

>>             election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3

>>       fsmap e21: 1/1/1 up {0=stor-vm4=up:active}

>>         mgr active: stor-vm1 standbys: stor-vm2

>>      osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs

>>             flags sortbitwise,require_jewel_osds,require_kraken_osds

>>       pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects

>>             11041 GB used, 16901 GB / 27943 GB avail

>>             196526/3554278 objects degraded (5.529%)

>>             1690392/3554278 objects misplaced (47.559%)

>>                  975 active+clean

>>                  364 active+remapped+backfill_wait

>>                   76 active+undersized+degraded+remapped+backfill_wait

>>                    3 active+recovery_wait+degraded+remapped

>>                    3 active+remapped+backfilling

>>                    3 active+degraded+remapped+backfilling

>>                    2 active+recovery_wait+degraded

>>                    1 active+clean+scrubbing+deep

>>                    1 active+undersized+degraded+remapped+backfilling

>> recovery io 112 MB/s, 28 objects/s

>>

>>

>> Seems related to the "corrupted rbd filesystems since jewel" thread.

>>

>>

>> log entries on the MDS server:

>>

>> 2017-05-23 18:27:12.966218 7f95ed6c0700  0 log_channel(cluster) log [WRN] :

>> slow request 243.113407 seconds old, received at 2017-05-23 18:23:09.852729:

>> client_request(client.204100:5 getattr pAsLsXsFs #100000003ec 2017-05-23

>> 17:48:23.770852 RETRY=2 caller_uid=0, caller_gid=0{}) currently failed to

>> rdlock, waiting

>>

>>

>> output of ceph daemon mds.stor-vm4 objecter_requests(changes each time I run

>> it)

>

> If that changes each time you run it then it means the OSD requests

> from the MDS are happening.

>

> However, it's possible that you have multiple clients and one of them

> is stuck trying to write something back (to a PG that is not accepting

> the write (yet?)), and thereby preventing the MDS from granting a lock

> for another client.

Given that the MDS is running a stat on the object, and the mentioned

server crash, I'm assuming it's probing the file size. Daniel, was

that crashed server also mounting CephFS as a client? (Or did you have

another client that went away?)

(Also note that if I'm interpreting it correctly, that object is the

0x3efb9f=4127647th in the file, or about 17TB worth. That doesn't look

like a journal ino to me so it must be a very large user file?)

-Greg

>

> What clients (+versions) are involved, what's the workload, what

> versions of Ceph?

>

> John

>

>> :

>> root@stor-vm4:/var/log/ceph# ceph daemon mds.stor-vm4 objecter_requests

>> {

>>     "ops": [

>>         {

>>             "tid": 66700,

>>             "pg": "1.60e95c32",

>>             "osd": 4,

>>             "object_id": "100000003ec.003efb9f",

>>             "object_locator": "@1",

>>             "target_object_id": "100000003ec.003efb9f",

>>             "target_object_locator": "@1",

>>             "paused": 0,

>>             "used_replica": 0,

>>             "precalc_pgid": 0,

>>             "last_sent": "1.47461e+06s",

>>             "attempts": 1,

>>             "snapid": "head",

>>             "snap_context": "0=[]",

>>             "mtime": "1969-12-31 19:00:00.000000s",

>>             "osd_ops": [

>>                 "stat"

>>             ]

>>         }

>>     ],

>>     "linger_ops": [],

>>     "pool_ops": [],

>>     "pool_stat_ops": [],

>>     "statfs_ops": [],

>>     "command_ops": []

>> }

>>

>>

>> I've tried restarting the mds daemon ( systemctl stop ceph-mds\*.service

>> ceph-mds.target &&  systemctl start ceph-mds\*.service ceph-mds.target )

>>

>>

>>

>> IO to the file that was being access when the host crashed is blocked.

>>

>>

>> Suggestions?

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com