Re: HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

Jiri Kanicky <jirik@xxxxxxxxxx> · Sun, 28 Dec 2014 17:41:38 +1100

Hi Christian,

Thank you for the valuable info. As I will use this cluster mainly at 
home for my data, and testing (backup in place), I will continue to use 
BTRFS. In production, I would go with XFS as recommended. ZFS - perhaps 
when this will become supported officially.

BTW, I fixed the HEALTH of my cluster:
1. I set "ceph osd pool set rbd size 2"
2. I set "ceph osd pool set rbd pg_num 256" and "ceph osd pool set rbd 
pgp_num 256"

5 pgs remained stuck unclean (stuck unclean since forever, current state 
active, last acting). I fixed this by restarting ceph -a. I think the 
OSD restart fixed this. I guess there might be more elegant solution, 
but I was not able to figure it out. Tried "pg repair" but that didn't 
do trick.

Anyway, it seems to be healthy now :).
cephadmin@ceph1:~$ sudo ceph status
    cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
     health HEALTH_OK
     monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
10, quorum 0,1 ceph1,ceph2
     osdmap e59: 4 osds: 4 up, 4 in
      pgmap v179: 256 pgs, 1 pools, 0 bytes data, 0 objects
            16924 kB used, 11154 GB / 11158 GB avail
                 256 active+clean

Thanks for the help!
Jiri

On 28/12/2014 16:59, Christian Balzer wrote:
Hello Jiri,

On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote:

Hi Christian.

Thank you for your comments again. Very helpful.

I will try to fix the current pool and see how it goes. Its good to
learn some troubleshooting skills.

Indeed, knowing what to do when things break is where it's at.

Regarding the BTRFS vs XFS, not sure if the documentation is old. My
decision was based on this:

http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/

It's dated for sure and a bit of wishful thinking on behalf of the Ceph
developers.
Who understandably didn't want to re-invent the wheel inside Ceph when the
underlying file system could provide it (checksums, snapshots, etc).

ZFS has all the features (and much better tested) BTRFS is aspiring to and
if kept below 80% utilization doesn't fragment itself to death.

And the end of that page they mention deduplication, which of course (as I
wrote recently in the "use ZFS for OSDs" thread is unlikely to do anything
worthwhile at all.

Simply put, some things _need_ to be done in Ceph to work properly and
can't be delegated to the underlying FS or other storage backend.

Christian

Note

We currently recommendXFSfor production deployments. We
recommendbtrfsfor testing, development, and any non-critical
deployments. *We believe that****btrfs****has the correct feature set
and roadmap to serve Ceph in the long-term*, butXFSandext4provide the
necessary stability for today’s deployments.btrfsdevelopment is
proceeding rapidly: users should be comfortable installing the latest
released upstream kernels and be able to track development activity for
critical bug fixes.

Thanks
Jiri

On 28/12/2014 16:01, Christian Balzer wrote:
Hello,

On Sun, 28 Dec 2014 11:58:59 +1100 jirik@xxxxxxxxxx wrote:

Hi Christian.

Thank you for your suggestions.

I will set the "osd pool default size" to 2 as you recommended. As
mentioned the documentation is talking about OSDs, not nodes, so that
must have confused me.

Note that changing this will only affect new pools of course. So to
sort out your current state either start over with this value set
before creating/starting anything or reduce the current size (ceph osd
pool set <poolname> size).

Have a look at the crushmap example or even better your own, current
one and you will see where by default the host is the failure domain.
Which of course makes a lot of sense.

Regarding the BTRFS, i thought that btrfs is better option for the
future providing more features. I know that XFS might be more stable,
but again my impression was that btrfs is the focus for future
development. Is that correct?

I'm not a developer, but if you scour the ML archives you will find a
number of threads about BTRFS (and ZFS).
The biggest issues with BTRFS are not just stability but also the fact
that it degrades rather quickly (fragmentation) due to the COW nature
of it and less smarts than ZFS in that area.
So development on the Ceph side is not the issue per se.

IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS
might become the better choice (in the future), with KV store backends
being an alternative for some use cases (also far from production
ready at this time).

Regards,

Christian
You are right with the round up. I forgot about that.

Thanks for your help. Much appreciated.
Jiri

----- Reply message -----
From: "Christian Balzer" <chibi@xxxxxxx>
To: <ceph-users@xxxxxxxx>
Cc: "Jiri Kanicky" <jirik@xxxxxxxxxx>
Subject:  HEALTH_WARN 29 pgs degraded; 29 pgs stuck
degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun,
Dec 28, 2014 03:29

Hello,

On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:

Hi,

I just build my CEPH cluster but having problems with the health of
the cluster.

You're not telling us the version, but it's clearly 0.87 or beyond.

Here are few details:
- I followed the ceph documentation.
Outdated, unfortunately.

- I used btrfs filesystem for all OSDs
Big mistake number 1, do some research (google, ML archives).
Though not related to to  your problems.

- I did not set "osd pool default size = 2 " as I thought that if I
have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
was right.
Big mistake, assumption number 2,  replications size by the default
CRUSH rule is determined by hosts. So that's your main issue here.
Either set it to 2 or use 3 hosts.

- I noticed that default pools "data,metadata" were not created. Only
"rbd" pool was created.
See outdated docs above. The majority of use cases is with RBD, so
since Giant the cephfs pools are not created by default.

- As it was complaining that the pg_num is too low, I increased the
pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num
133
   > pgp_num 64".

Re-read the (in this case correct) documentation.
It clearly states to round up to nearest power of 2, in your case 256.

Regards.

Christian

Would you give me hint where I have made the mistake? (I can remove
the OSDs and start over if needed.)

cephadmin@ceph1:/etc/ceph$ sudo ceph health
HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck
unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
133
   > pgp_num 64
cephadmin@ceph1:/etc/ceph$ sudo ceph status
       cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
        health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
rbd pg_num 133 > pgp_num 64
        monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 8, quorum 0,1 ceph1,ceph2
        osdmap e42: 4 osds: 4 up, 4 in
         pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
               11704 kB used, 11154 GB / 11158 GB avail
                     29 active+undersized+degraded
                    104 active+remapped

cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
# id    weight  type name       up/down reweight
-1      10.88   root default
-2      5.44            host ceph1
0       2.72                    osd.0   up      1
1       2.72                    osd.1   up      1
-3      5.44            host ceph2
2       2.72                    osd.2   up      1
3       2.72                    osd.3   up      1

cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
0 rbd,

cephadmin@ceph1:/etc/ceph$ cat ceph.conf
[global]
fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
public_network = 192.168.30.0/24
cluster_network = 10.1.1.0/24
mon_initial_members = ceph1, ceph2
mon_host = 192.168.30.21,192.168.30.22
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

Thank you
Jiri

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com