On Wed, May 24, 2017 at 3:15 AM, John Spray <jspray@xxxxxxxxxx> wrote: > On Tue, May 23, 2017 at 11:41 PM, Daniel K <sathackr@xxxxxxxxx> wrote: >> Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs >> enroute. >> >> I was "beating up" on the cluster, and had been writing to a 6TB file in >> CephFS for several hours, during which I changed the crushmap to better >> match my environment, generating a bunch of recovery IO. After about 5.8TB >> written, one of the OSD(which is also a MON..soon to be rectivied) hosts >> crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph -s: >> (The degraded/misplaced warnings are likely because the cluster hasn't >> completed rebalancing after I changed the crushmap I assume) >> > > Losing a quarter of your OSDs down while simultaneously rebalancing > after editing your CRUSH map is a brutal thing to a Ceph cluster, and > I would expect it to impact your client IO severely. > > I see that you've got 112MB/s of recovery going on, which may or may > not be saturating some links depending on whether you're using 1gig or > 10gig networking. > >> 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following dangerous >> and experimental features are enabled: bluestore >> 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following dangerous >> and experimental features are enabled: bluestore >> cluster e92e20ca-0fe6-4012-86cc-aa51e0466661 >> health HEALTH_WARN >> 440 pgs backfill_wait >> 7 pgs backfilling >> 85 pgs degraded >> 5 pgs recovery_wait >> 85 pgs stuck degraded >> 452 pgs stuck unclean >> 77 pgs stuck undersized >> 77 pgs undersized >> recovery 196526/3554278 objects degraded (5.529%) >> recovery 1690392/3554278 objects misplaced (47.559%) >> mds0: 1 slow requests are blocked > 30 sec >> monmap e4: 3 mons at >> {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,stor-vm3=10.0.15.53:6789/0} >> election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3 >> fsmap e21: 1/1/1 up {0=stor-vm4=up:active} >> mgr active: stor-vm1 standbys: stor-vm2 >> osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs >> flags sortbitwise,require_jewel_osds,require_kraken_osds >> pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects >> 11041 GB used, 16901 GB / 27943 GB avail >> 196526/3554278 objects degraded (5.529%) >> 1690392/3554278 objects misplaced (47.559%) >> 975 active+clean >> 364 active+remapped+backfill_wait >> 76 active+undersized+degraded+remapped+backfill_wait >> 3 active+recovery_wait+degraded+remapped >> 3 active+remapped+backfilling >> 3 active+degraded+remapped+backfilling >> 2 active+recovery_wait+degraded >> 1 active+clean+scrubbing+deep >> 1 active+undersized+degraded+remapped+backfilling >> recovery io 112 MB/s, 28 objects/s >> >> >> Seems related to the "corrupted rbd filesystems since jewel" thread. >> >> >> log entries on the MDS server: >> >> 2017-05-23 18:27:12.966218 7f95ed6c0700 0 log_channel(cluster) log [WRN] : >> slow request 243.113407 seconds old, received at 2017-05-23 18:23:09.852729: >> client_request(client.204100:5 getattr pAsLsXsFs #100000003ec 2017-05-23 >> 17:48:23.770852 RETRY=2 caller_uid=0, caller_gid=0{}) currently failed to >> rdlock, waiting >> >> >> output of ceph daemon mds.stor-vm4 objecter_requests(changes each time I run >> it) > > If that changes each time you run it then it means the OSD requests > from the MDS are happening. > > However, it's possible that you have multiple clients and one of them > is stuck trying to write something back (to a PG that is not accepting > the write (yet?)), and thereby preventing the MDS from granting a lock > for another client. Given that the MDS is running a stat on the object, and the mentioned server crash, I'm assuming it's probing the file size. Daniel, was that crashed server also mounting CephFS as a client? (Or did you have another client that went away?) (Also note that if I'm interpreting it correctly, that object is the 0x3efb9f=4127647th in the file, or about 17TB worth. That doesn't look like a journal ino to me so it must be a very large user file?) -Greg > > What clients (+versions) are involved, what's the workload, what > versions of Ceph? > > John > >> : >> root@stor-vm4:/var/log/ceph# ceph daemon mds.stor-vm4 objecter_requests >> { >> "ops": [ >> { >> "tid": 66700, >> "pg": "1.60e95c32", >> "osd": 4, >> "object_id": "100000003ec.003efb9f", >> "object_locator": "@1", >> "target_object_id": "100000003ec.003efb9f", >> "target_object_locator": "@1", >> "paused": 0, >> "used_replica": 0, >> "precalc_pgid": 0, >> "last_sent": "1.47461e+06s", >> "attempts": 1, >> "snapid": "head", >> "snap_context": "0=[]", >> "mtime": "1969-12-31 19:00:00.000000s", >> "osd_ops": [ >> "stat" >> ] >> } >> ], >> "linger_ops": [], >> "pool_ops": [], >> "pool_stat_ops": [], >> "statfs_ops": [], >> "command_ops": [] >> } >> >> >> I've tried restarting the mds daemon ( systemctl stop ceph-mds\*.service >> ceph-mds.target && systemctl start ceph-mds\*.service ceph-mds.target ) >> >> >> >> IO to the file that was being access when the host crashed is blocked. >> >> >> Suggestions? >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com