Re: MDSs report slow metadata IOs

burcarjo@xxxxxxxxx · Thu, 12 Sep 2019 09:01:15 -0000

The current setup is only for testing functionality with ceph.  My idea was to install a production setup with suitable hardware if all goes fine ...

MDS is running on a node with 4 GB RAM, 1Gb/E and 4 core processor, 
The metadata pool is on 3 OSDs servers with 2 OSD per node: 24 GB RAM, 12 core, 1Gb/E
Monitor is running on the same node that  the OSD server.
I know that is a poor hardware but I only have one client writing and one client listing. 

[cephuser@storage1demo ~]$ ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 1/230761 objects misplaced (0.000%)
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mdsstor1demo(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 895 secs
OBJECT_MISPLACED 1/230761 objects misplaced (0.000%)

[cephuser@storage1demo ~]$ ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR  PGS 
 0   hdd 1.81940  0.90002 1.8 TiB 183 GiB 1.6 TiB 9.83 1.07 147 
 1   hdd 0.45479  1.00000 466 GiB  43 GiB 423 GiB 9.16 1.00  45 
 2   hdd 1.81940  1.00000 1.8 TiB 182 GiB 1.6 TiB 9.80 1.07 109 
 3   hdd 1.81940  1.00000 1.8 TiB 143 GiB 1.7 TiB 7.67 0.83 112 
 4   hdd 1.81940  1.00000 1.8 TiB 182 GiB 1.6 TiB 9.78 1.06 118 
 5   hdd 1.81940  1.00000 1.8 TiB 165 GiB 1.7 TiB 8.87 0.96 109 
                    TOTAL 9.6 TiB 899 GiB 8.7 TiB 9.19          
MIN/MAX VAR: 0.83/1.07  STDDEV: 0.77

[cephuser@storage1demo ~]$ ceph osd status
+----+--------------+-------+-------+--------+---------+--------+---------+-----------+
| id |     host     |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+--------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | storage1demo |  183G | 1679G |    8   |  20.0M  |    0   |     0   | exists,up |
| 1  | storage1demo | 42.6G |  423G |    4   |  9830k  |    0   |     0   | exists,up |
| 2  | storage2demo |  182G | 1680G |   15   |  24.8M  |    0   |     0   | exists,up |
| 3  | storage2demo |  142G | 1720G |    8   |  9833k  |    0   |     0   | exists,up |
| 4  | storage3demo |  182G | 1680G |   13   |  23.2M  |    0   |     0   | exists,up |
| 5  | storage3demo |  165G | 1697G |    7   |  13.6M  |    0   |     0   | exists,up |
+----+--------------+-------+-------+--------+---------+--------+---------+-----------+

[cephuser@storage1demo ~]$ rados df
POOL_NAME          USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS      RD  WR_OPS      WR 
cephfs_data     446 GiB  115328      0 230656                  0       0        0  42754 111 GiB 2527943 5.0 TiB 
cephfs_metadata  50 MiB      35      0    105                  0       0        0    166  38 MiB   24857  62 MiB 

total_objects    115363
total_used       899 GiB
total_avail      8.7 TiB
total_space      9.6 TiB

While I'm doing "ls" the MDS daemon socket shows events of type:
"event": "failed to xlock, waiting"
"event": "failed to rdlock, waiting"
MDS logs:
2019-09-12 10:35:33.582 7fbaebfaf700  1 mds.stor1demo Updating MDS map to version 941 from mon.0
2019-09-12 10:35:36.012 7fbae9521700  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 51.253125 secs
2019-09-12 10:35:37.333 7fbaebfaf700  1 mds.stor1demo Updating MDS map to version 942 from mon.0
2019-09-12 10:35:41.012 7fbae9521700  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 56.253176 secs
2019-09-12 10:35:41.332 7fbaebfaf700  1 mds.stor1demo Updating MDS map to version 943 from mon.0
2019-09-12 10:35:46.012 7fbae9521700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 61.253205 secs
2019-09-12 10:35:46.012 7fbae9521700  0 log_channel(cluster) log [WRN] : slow request 61.253204 seconds old, received at 2019-09-12 10:34:44.760404: client_request(client.4394:5797 getattr pAsLsXsFs #0x10000000f6e 2019-09-12 10:34:44.759327 caller_uid=1000, caller_gid=1000{}) currently failed to rdlock, waiting
2019-09-12 10:35:46.013 7fbae9521700  0 log_channel(cluster) log [WRN] : client.4375 isn't responding to mclientcaps(revoke), ino 0x10000000f6e pending pAsLsXsFsc issued pAsLsXsFscb, sent 61.253848 seconds ago
2019-09-12 10:35:49.333 7fbaebfaf700  1 mds.stor1demo Updating MDS map to version 944 from mon.0
2019-09-12 10:35:51.013 7fbae9521700  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 66.253237 secs

If I do  "ls"  from the same node that is writing then ls works fine. Seems a locked MDS troubles ?

Thanks.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx