With help from the list we recently recovered one of our Jewel based clusters that started failing when we got to about 4800 cephfs snapshots. We understand that cephfs snapshots are still marked experimental. We are running a single active MDS with 2 standby MDS. We only have a single file system, we are only taking snapshots from the top level directory, and we are now planning on limiting snapshots to a few hundred. Currently we have removed all snapshots from this system, using rmdir on each snapshot directory, and the system is reporting that it is healthy:
ceph -s
cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4
health HEALTH_OK
monmap e1: 3 mons at {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
election epoch 202, quorum 0,1,2 mon01,mon02,mon03
fsmap e18283: 1/1/1 up {0=mds01=up:active}, 2 up:standby
osdmap e342543: 93 osds: 93 up, 93 in
flags sortbitwise,require_jewel_osds
pgmap v38759308: 11336 pgs, 9 pools, 23107 GB data, 12086 kobjects
73956 GB used, 209 TB / 281 TB avail
11336 active+clean
client io 509 kB/s rd, 2548 B/s wr, 0 op/s rd, 1 op/s wr
The snapshots were removed several days ago, but just as an experiment I decided to query a few PGs in the cephfs data storage pool, and I am seeing they are all listing:
“purged_snaps": "[2~12cd,12d0~12c9]",
Here is an example:
ceph pg 1.72 query
{
"state": "active+clean",
"snap_trimq": "[]",
"epoch": 342540,
"up": [
75,
77,
82
],
"acting": [
75,
77,
82
],
"actingbackfill": [
"75",
"77",
"82"
],
"info": {
"pgid": "1.72",
"last_update": "342540'261039",
"last_complete": "342540'261039",
"log_tail": "341080'260697",
"last_user_version": 261039,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": "[2~12cd,12d0~12c9]",
…
Is this an issue?
I am not seeing any recent trim activity.
Are there any procedures documented for looking at snapshots to see if there are any issues?
Before posting this, I have reread the cephfs and snapshot pages in at:
Looked at the slides:
Watched the video “Ceph Snapshots for Fun and Profit” given at the last OpenStack conference.
And I still can’t find much on info on debugging snapshots.
Here is some addition information on the cluster:
ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
281T 209T 73955G 25.62
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 16 0 56326G 3
cephfs_data 1 22922G 28.92 56326G 12279871
cephfs_metadata 2 89260k 0 56326G 45232
cinder 9 147G 0.26 56326G 41420
glance 10 0 0 56326G 0
cinder-backup 11 0 0 56326G 0
cinder-ssltest 23 1362M 0 56326G 431
IDMT-dfgw02 27 2552M 0 56326G 758
dfbackup 28 33987M 0.06 56326G 8670
Recent tickets and posts on problems with this cluster
ceph -v
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
Kernel is 4.13.1
uname -a
Linux ss001 4.13.1-041301-generic #201709100232 SMP Sun Sep 10 06:33:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
OS is Ubuntu 16.04
Thanks
Eric
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com