Re: Looking for help with debugging cephfs snapshots

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 23 Oct 2017 10:05:34 +0800

On Mon, Oct 23, 2017 at 9:35 AM, Eric Eastman
<eric.eastman@xxxxxxxxxxxxxx> wrote:
> With help from the list we recently recovered one of our Jewel based
> clusters that started failing when we got to about 4800 cephfs snapshots.
> We understand that cephfs snapshots are still marked experimental.   We are
> running a single active MDS with 2 standby MDS. We only have a single file
> system, we are only taking snapshots from the top level directory, and we
> are now planning on limiting snapshots to a few hundred. Currently we have
> removed all snapshots from this system, using rmdir on each snapshot
> directory, and the system is reporting that it is healthy:
>
> ceph -s
>     cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4
>      health HEALTH_OK
>      monmap e1: 3 mons at
> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>             election epoch 202, quorum 0,1,2 mon01,mon02,mon03
>       fsmap e18283: 1/1/1 up {0=mds01=up:active}, 2 up:standby
>      osdmap e342543: 93 osds: 93 up, 93 in
>             flags sortbitwise,require_jewel_osds
>       pgmap v38759308: 11336 pgs, 9 pools, 23107 GB data, 12086 kobjects
>             73956 GB used, 209 TB / 281 TB avail
>                11336 active+clean
>   client io 509 kB/s rd, 2548 B/s wr, 0 op/s rd, 1 op/s wr
>
> The snapshots were removed several days ago, but just as an experiment I
> decided to query a few PGs in the cephfs data  storage pool, and I am seeing
> they are all listing:
>
> “purged_snaps": "[2~12cd,12d0~12c9]",

purged_snaps IDs of snapshots whose data have been completely purged.
Currently purged_snap set is append only, osd never remove ID from it.

Regards
Yan, Zheng

>
> Here is an example:
>
> ceph pg 1.72 query
> {
>     "state": "active+clean",
>     "snap_trimq": "[]",
>     "epoch": 342540,
>     "up": [
>         75,
>         77,
>         82
>     ],
>     "acting": [
>         75,
>         77,
>         82
>     ],
>     "actingbackfill": [
>         "75",
>         "77",
>         "82"
>     ],
>     "info": {
>         "pgid": "1.72",
>         "last_update": "342540'261039",
>         "last_complete": "342540'261039",
>         "log_tail": "341080'260697",
>         "last_user_version": 261039,
>         "last_backfill": "MAX",
>         "last_backfill_bitwise": 1,
>         "purged_snaps": "[2~12cd,12d0~12c9]",
> …
>
> Is this an issue?
> I am not seeing any recent trim activity.
> Are there any procedures documented for looking at snapshots to see if there
> are any issues?
>
> Before posting this, I have reread the cephfs and snapshot pages in at:
> http://docs.ceph.com/docs/master/cephfs/
> http://docs.ceph.com/docs/master/dev/cephfs-snapshots/
>
> Looked at the slides:
> http://events.linuxfoundation.org/sites/events/files/slides/2017-03-23%20Vault%20Snapshots.pdf
>
> Watched the video “Ceph Snapshots for Fun and Profit” given at the last
> OpenStack conference.
>
> And I still can’t find much on info on debugging snapshots.
>
> Here is some addition information on the cluster:
>
> ceph df
> GLOBAL:
>     SIZE     AVAIL     RAW USED     %RAW USED
>     281T      209T       73955G         25.62
> POOLS:
>     NAME                ID     USED       %USED     MAX AVAIL     OBJECTS
>     rbd                 0          16         0        56326G            3
>     cephfs_data         1      22922G     28.92        56326G     12279871
>     cephfs_metadata     2      89260k         0        56326G        45232
>     cinder              9        147G      0.26        56326G        41420
>     glance              10          0         0        56326G            0
>     cinder-backup       11          0         0        56326G            0
>     cinder-ssltest      23      1362M         0        56326G          431
>     IDMT-dfgw02         27      2552M         0        56326G          758
>     dfbackup            28     33987M      0.06        56326G         8670
>
>
> Recent tickets and posts on problems with this cluster
> http://tracker.ceph.com/issues/21761
> http://tracker.ceph.com/issues/21412
> https://www.spinics.net/lists/ceph-devel/msg38203.html
>
> ceph -v
> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>
> Kernel is 4.13.1
> uname -a
> Linux ss001 4.13.1-041301-generic #201709100232 SMP Sun Sep 10 06:33:36 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux
>
> OS is Ubuntu 16.04
>
> Thanks
> Eric
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com