Looking for help with debugging cephfs snapshots

Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> · Sun, 22 Oct 2017 19:35:05 -0600

With help from the list we recently recovered one of our Jewel based clusters that started failing when we got to about 4800 cephfs snapshots.   We understand that cephfs snapshots are still marked experimental.   We are running a single active MDS with 2 standby MDS. We only have a single file system, we are only taking snapshots from the top level directory, and we are now planning on limiting snapshots to a few hundred. Currently we have removed all snapshots from this system, using rmdir on each snapshot directory, and the system is reporting that it is healthy:

ceph -s
    cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4
     health HEALTH_OK
     monmap e1: 3 mons at {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
            election epoch 202, quorum 0,1,2 mon01,mon02,mon03
      fsmap e18283: 1/1/1 up {0=mds01=up:active}, 2 up:standby
     osdmap e342543: 93 osds: 93 up, 93 in
            flags sortbitwise,require_jewel_osds
      pgmap v38759308: 11336 pgs, 9 pools, 23107 GB data, 12086 kobjects
            73956 GB used, 209 TB / 281 TB avail
               11336 active+clean
  client io 509 kB/s rd, 2548 B/s wr, 0 op/s rd, 1 op/s wr

The snapshots were removed several days ago, but just as an experiment I decided to query a few PGs in the cephfs data  storage pool, and I am seeing they are all listing:

“purged_snaps": "[2~12cd,12d0~12c9]",

Here is an example:

ceph pg 1.72 query
{
    "state": "active+clean",
    "snap_trimq": "[]",
    "epoch": 342540,
    "up": [
        75,
        77,
        82
    ],
    "acting": [
        75,
        77,
        82
    ],
    "actingbackfill": [
        "75",
        "77",
        "82"
    ],
    "info": {
        "pgid": "1.72",
        "last_update": "342540'261039",
        "last_complete": "342540'261039",
        "log_tail": "341080'260697",
        "last_user_version": 261039,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 1,
        "purged_snaps": "[2~12cd,12d0~12c9]",
…

Is this an issue?  
I am not seeing any recent trim activity.
Are there any procedures documented for looking at snapshots to see if there are any issues?

Before posting this, I have reread the cephfs and snapshot pages in at:
http://docs.ceph.com/docs/master/cephfs/
http://docs.ceph.com/docs/master/dev/cephfs-snapshots/

Looked at the slides:
http://events.linuxfoundation.org/sites/events/files/slides/2017-03-23%20Vault%20Snapshots.pdf

Watched the video “Ceph Snapshots for Fun and Profit” given at the last OpenStack conference. 

And I still can’t find much on info on debugging snapshots.

Here is some addition information on the cluster:

ceph df
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED 
    281T      209T       73955G         25.62 
POOLS:
    NAME                ID     USED       %USED     MAX AVAIL     OBJECTS  
    rbd                 0          16         0        56326G            3 
    cephfs_data         1      22922G     28.92        56326G     12279871 
    cephfs_metadata     2      89260k         0        56326G        45232 
    cinder              9        147G      0.26        56326G        41420 
    glance              10          0         0        56326G            0 
    cinder-backup       11          0         0        56326G            0 
    cinder-ssltest      23      1362M         0        56326G          431 
    IDMT-dfgw02         27      2552M         0        56326G          758 
    dfbackup            28     33987M      0.06        56326G         8670 

Recent tickets and posts on problems with this cluster
http://tracker.ceph.com/issues/21761
http://tracker.ceph.com/issues/21412
https://www.spinics.net/lists/ceph-devel/msg38203.html

ceph -v
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)

Kernel is 4.13.1
uname -a
Linux ss001 4.13.1-041301-generic #201709100232 SMP Sun Sep 10 06:33:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

OS is Ubuntu 16.04

Thanks
Eric
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com