MDS blocked ops; kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, I encountered a problem with blocked MDS operations and a client becoming unresponsive. I dumped the MDS cache, ops, blocked ops and some further log information here:

https://files.dtu.dk/u/peQSOY1kEja35BI5/2010-09-03-mds-blocked-ops?l

A user of our HPC system was running a job that creates a somewhat stressful MDS load. This workload tends to lead to MDS warnings like "slow metadata ops" and "client does not respond to caps release", which usually disappear without intervantion after a while.

He cancelled the job and one operation from one of the clients remained stuck in the MDS. We had a health warning about 1 blocked meta data operation and one client failing to respond to caps release. I should mention that we execute "echo 3 > /proc/sys/vm/drop_caches" in the epilogue script executed after every job, which usually cleans up all unused caps without problems. So, at the time I was looking at the number of client caps, these were down to below 100 for the client in question due to epilogue script execution. Looks like there might be a race condition with the drop caches and MDS requests.

In addition, while this happened, there was backfill going on. All PGs were active+other stuff. All storage was r/w-accessible.

On the client side, this was in the logs:

Sep  3 09:15:57 sn110 kernel: INFO: task kworker/0:1:79782 blocked for more than 120 seconds.
Sep  3 09:15:57 sn110 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  3 09:15:57 sn110 kernel: kworker/0:1     D ffff995cf4614100     0 79782      2 0x00000000
Sep  3 09:15:57 sn110 kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph]
Sep  3 09:15:57 sn110 kernel: Call Trace:
[... see link above ...]

I did not see slow ops on any of the OSDs. All other information in the link above.

We had to reboot the client to resolve this problem. It seems like the MDS does not clean up blocked requests in certain situations when it ought to be possible. I hope the cache and ops dumps help pinpoint the reason.

Best regards,
Frank
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux