Re: Looking for help with debugging cephfs snapshots

Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> · Sun, 22 Oct 2017 20:39:31 -0600

On Sun, Oct 22, 2017 at 8:05 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
On Mon, Oct 23, 2017 at 9:35 AM, Eric Eastman

<eric.eastman@xxxxxxxxxxxxxx> wrote:

> With help from the list we recently recovered one of our Jewel based

> clusters that started failing when we got to about 4800 cephfs snapshots.

> We understand that cephfs snapshots are still marked experimental.   We are

> running a single active MDS with 2 standby MDS. We only have a single file

> system, we are only taking snapshots from the top level directory, and we

> are now planning on limiting snapshots to a few hundred. Currently we have

> removed all snapshots from this system, using rmdir on each snapshot

> directory, and the system is reporting that it is healthy:

>

> ceph -s

>     cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4

>      health HEALTH_OK

>      monmap e1: 3 mons at

> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}

>             election epoch 202, quorum 0,1,2 mon01,mon02,mon03

>       fsmap e18283: 1/1/1 up {0=mds01=up:active}, 2 up:standby

>      osdmap e342543: 93 osds: 93 up, 93 in

>             flags sortbitwise,require_jewel_osds

>       pgmap v38759308: 11336 pgs, 9 pools, 23107 GB data, 12086 kobjects

>             73956 GB used, 209 TB / 281 TB avail

>                11336 active+clean

>   client io 509 kB/s rd, 2548 B/s wr, 0 op/s rd, 1 op/s wr

>

> The snapshots were removed several days ago, but just as an experiment I

> decided to query a few PGs in the cephfs data  storage pool, and I am seeing

> they are all listing:

>

> “purged_snaps": "[2~12cd,12d0~12c9]",

purged_snaps IDs of snapshots whose data have been completely purged.

Currently purged_snap set is append only, osd never remove ID from it. 

Thank you for the quick reply.
So it is normal to have "purged_snaps" listed on a system that all snapshots have been deleted. 
Eric

>

> Here is an example:

>

> ceph pg 1.72 query

> {

>     "state": "active+clean",

>     "snap_trimq": "[]",

>     "epoch": 342540,

>     "up": [

>         75,

>         77,

>         82

>     ],

>     "acting": [

>         75,

>         77,

>         82

>     ],

>     "actingbackfill": [

>         "75",

>         "77",

>         "82"

>     ],

>     "info": {

>         "pgid": "1.72",

>         "last_update": "342540'261039",

>         "last_complete": "342540'261039",

>         "log_tail": "341080'260697",

>         "last_user_version": 261039,

>         "last_backfill": "MAX",

>         "last_backfill_bitwise": 1,

>         "purged_snaps": "[2~12cd,12d0~12c9]",

> …

>

> Is this an issue?

> I am not seeing any recent trim activity.

> Are there any procedures documented for looking at snapshots to see if there

> are any issues?

>

> Before posting this, I have reread the cephfs and snapshot pages in at:

> http://docs.ceph.com/docs/master/cephfs/

> http://docs.ceph.com/docs/master/dev/cephfs-snapshots/

>

> Looked at the slides:

> http://events.linuxfoundation.org/sites/events/files/slides/2017-03-23%20Vault%20Snapshots.pdf

>

> Watched the video “Ceph Snapshots for Fun and Profit” given at the last

> OpenStack conference.

>

> And I still can’t find much on info on debugging snapshots.

>

> Here is some addition information on the cluster:

>

> ceph df

> GLOBAL:

>     SIZE     AVAIL     RAW USED     %RAW USED

>     281T      209T       73955G         25.62

> POOLS:

>     NAME                ID     USED       %USED     MAX AVAIL     OBJECTS

>     rbd                 0          16         0        56326G            3

>     cephfs_data         1      22922G     28.92        56326G     12279871

>     cephfs_metadata     2      89260k         0        56326G        45232

>     cinder              9        147G      0.26        56326G        41420

>     glance              10          0         0        56326G            0

>     cinder-backup       11          0         0        56326G            0

>     cinder-ssltest      23      1362M         0        56326G          431

>     IDMT-dfgw02         27      2552M         0        56326G          758

>     dfbackup            28     33987M      0.06        56326G         8670

>

>

> Recent tickets and posts on problems with this cluster

> http://tracker.ceph.com/issues/21761

> http://tracker.ceph.com/issues/21412

> https://www.spinics.net/lists/ceph-devel/msg38203.html

>

> ceph -v

> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)

>

> Kernel is 4.13.1

> uname -a

> Linux ss001 4.13.1-041301-generic #201709100232 SMP Sun Sep 10 06:33:36 UTC

> 2017 x86_64 x86_64 x86_64 GNU/Linux

>

> OS is Ubuntu 16.04

>

> Thanks

> Eric

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com