How to prevent blocked requests?

Mehmet <ceph@xxxxxxxxxx> · Fri, 24 Feb 2017 13:43:52 +0100

Hey friends,

a month ago i had an issue with few blocked requests where some of my 
VMs did freeze while this happened.
I guessed the culprit was a spinning disk with a lot of "delayed ECC" 
(showed via smartctl: 48701).

So we decided to take this osd down/out to do some checks. After this 
blocked requests were gone and we got no more freezes.

Btw, this is related to the mentioned blocked requests
*dmesg* on the Server produced (two times)
[4927177.901845] INFO: task filestore_sync:5907 blocked for more than 
120 seconds.
[4927177.902147]       Tainted: G          I     4.4.0-43-generic 
#63-Ubuntu
[4927177.902416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[4927177.902735] filestore_sync  D ffff8810073e3e00     0  5907      1 
0x00000000
[4927177.902741]  ffff8810073e3e00 ffff88102a1f0db8 ffff8810367fb700 
ffff8810281b0dc0
[4927177.902745]  ffff8810073e4000 ffff88102a1f0de8 ffff88102a1f0a98 
ffff8810073e3e8c
[4927177.902748]  00005638fa13e000 ffff8810073e3e18 ffffffff8182d7c5 
ffff8810073e3e8c
[4927177.902751] Call Trace:
[4927177.902764]  [<ffffffff8182d7c5>] schedule+0x35/0x80
[4927177.902771]  [<ffffffff812378e8>] wb_wait_for_completion+0x58/0xa0
[4927177.902779]  [<ffffffff810c3dd0>] ? 
wake_atomic_t_function+0x60/0x60
[4927177.902782]  [<ffffffff8123b2d3>] sync_inodes_sb+0xa3/0x1f0
[4927177.902786]  [<ffffffff812418ea>] sync_filesystem+0x5a/0xa0
[4927177.902789]  [<ffffffff81241a7e>] SyS_syncfs+0x3e/0x70
[4927177.902794]  [<ffffffff818318b2>] 
entry_SYSCALL_64_fastpath+0x16/0x71

Later (after smartctl long check) we put the mentioned osd in again and 
had also no more issues.

Finaly my Question :)

Is Ceph able to deal with "problematic" disks? How to tune this? Perhaps 
special timeouts?
I mean, let's say ceph cannot read a shard of a pg because there is a 
"i/o error"? Or..
When a OSD takes too long - like the dmesg output above?
In our setup we are using size of 3, so when a read/write request takes 
too much time ceph should be able to use another copy of the shard (for 
reads).

This is my Setup (in production):

*Software/OS*
- Jewel
#> ceph tell osd.* version | grep version | uniq
"version": "ceph version 10.2.3 
(ecc23778eb545d8dd55e2e4735b53cc93f92e65b)"

#> ceph tell mon.* version
 [...] ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

- Ubuntu 16.04.01 LTS on all OSD and MON Server
#> uname -a
Linux server 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 
2016 x86_64 x86_64 x86_64 GNU/Linux

*Server*
4x OSD Server, 3x with

- 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no 
Hyper-Threading
- 64GB RAM
- 12x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs
- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for 
12 Disks (20G Journal size)
- 1x Samsung SSD 840/850 Pro only for the OS

and 1x OSD Server with

- 1x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (10 Cores 20 Threads)
- 64GB RAM
- 23x 2TB TOSHIBA MK2001TRKB SAS2 (6GB/s) Disks as OSDs
- 1x SEAGATE ST32000445SS SAS2 (6GB/s) Disk as OSDs
- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for 
24 Disks (15G Journal size)
- 1x Samsung SSD 850 Pro only for the OS

3x MON Server

- Two of them with 1x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz (4 
Cores, 8 Threads)
- The third one has 2x Intel(R) Xeon(R) CPU L5430 @ 2.66GHz ==> 8 Cores, 
no Hyper-Threading
- 32 GB RAM
- 1x Raid 10 (4 Disks)

*Network*

- Each Server and Client has 2x 10GB (LACP);
- We do not use Jumbo Frames yet..
- Public and Cluster-Network related Ceph traffic is going through this 
one active (LACP) 10GB Interface on each Server.

*ceph.conf*
[global]
fsid = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
public_network = xxx.16.0.0/24
cluster_network = xx.0.0.0/24
mon_initial_members = monserver1, monserver2, monserver3
mon_host = xxx.16.0.2,xxx.16.0.3,xxx.16.0.4
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_crush_initial_weight = 0

mon_osd_full_ratio = 0.90
mon_osd_nearfull_ratio = 0.80

[mon]
mon_allow_pool_delete = false

[osd]
#osd_journal_size = 20480
osd_journal_size = 15360

Please ask if you need more information.
Thanks so far.

- Mehmet
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com