Re: Full OSD with 29% free

Bryan Stillwell <bstillwell@xxxxxxxxxxxxxxx> · Thu, 31 Oct 2013 11:36:48 -0600

Shain,

I investigated the segfault a little more since I sent this message
and found this email thread:

http://oss.sgi.com/archives/xfs/2012-06/msg00066.html

After reading that I did the following:

[root@den2ceph001 ~]# xfs_db -r "-c freesp -s" /dev/sdb1
Segmentation fault (core dumped)
[root@den2ceph001 ~]# service ceph stop osd.0
=== osd.0 ===
Stopping Ceph osd.0 on den2ceph001...kill 2407...kill 2407...done
[root@den2ceph001 ~]# umount /dev/sdb1
[root@den2ceph001 ~]# xfs_db -r "-c freesp -s" /dev/sdb1
   from      to extents  blocks    pct
      1       1   44510   44510   0.05
      2       3   60341  142274   0.16
      4       7   68836  355735   0.39
      8      15  274122 3212122   3.50
     16      31 1429274 37611619  41.02
     32      63   43225 1945740   2.12
     64     127   39480 3585579   3.91
    128     255   36046 6544005   7.14
    256     511   30946 10899979  11.89
    512    1023   14119 9907129  10.80
   1024    2047    5727 7998938   8.72
   2048    4095    2647 6811258   7.43
   4096    8191     362 1940622   2.12
   8192   16383      59  603690   0.66
  16384   32767       5   90464   0.10
total free extents 2049699
total free blocks 91693664
average free extent size 44.7352

That gives me a little more confidence in using 2K block sizes now.  :)

Bryan

On Thu, Oct 31, 2013 at 11:02 AM, Bryan Stillwell
<bstillwell@xxxxxxxxxxxxxxx> wrote:
> Shain,
>
> After getting the segfaults when running 'xfs_db -r "-c freesp -s"' on
> a couple partitions, I'm concerned that 2K block sizes aren't nearly
> as well tested as 4K block sizes.  This could just be a problem with
> RHEL/CentOS 6.4 though, so if you're using a newer kernel the problem
> might already be fixed.  There also appears to be more overhead with
> 2K block sizes which I believe manifests as high CPU usage by the
> xfsalloc processes.  However, my cluster has been running in a clean
> state for over 24 hours and none of the scrubs have found a problem
> yet.
>
> According to 'ceph -s' my cluster has the following stats:
>
>      osdmap e16882: 40 osds: 40 up, 40 in
>       pgmap v3520420: 2808 pgs, 13 pools, 5694 GB data, 72705 kobjects
>             18095 GB used, 13499 GB / 31595 GB avail
>
> That's about 78k per object on average, so if your files aren't that
> small I would stay with 4K block sizes to avoid headaches.
>
> Bryan
>
>
> On Thu, Oct 31, 2013 at 6:43 AM, Shain Miley <SMiley@xxxxxxx> wrote:
>>
>> Bryan,
>>
>> We are setting up a cluster using xfs and have been a bit concerned about running into similar issues to the ones you described below.
>>
>> I am just wondering if you came across any potential downsides to using a 2K block size with xfs on your osd's.
>>
>> Thanks,
>>
>> Shain
>>
>> Shain Miley | Manager of Systems and Infrastructure, Digital Media | smiley@xxxxxxx | 202.513.3649
>>
>> ________________________________________
>> From: ceph-users-bounces@xxxxxxxxxxxxxx [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Bryan Stillwell [bstillwell@xxxxxxxxxxxxxxx]
>> Sent: Wednesday, October 30, 2013 2:18 PM
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  Full OSD with 29% free
>>
>> I wanted to report back on this since I've made some progress on
>> fixing this issue.
>>
>> After converting every OSD on a single server to use a 2K block size,
>> I've been able to cross 90% utilization without running into the 'No
>> space left on device' problem.  They're currently between 51% and 75%,
>> but I hit 90% over the weekend after a couple OSDs died during
>> recovery.
>>
>> This conversion was pretty rough though with OSDs randomly dying
>> multiple times during the process (logs point at suicide time outs).
>> When looking at top I would frequently see xfsalloc pegging multiple
>> cores, so I wonder if that has something to do with it.  I also had
>> the 'xfs_db -r "-c freesp -s"' command segfault on me a few times
>> which was fixed by running xfs_repair on those partitions.  This has
>> me wondering how well XFS is tested with non-default block sizes on
>> CentOS 6.4...
>>
>> Anyways, after about a week I was finally able to get the cluster to
>> fully recover today.  Now I need to repeat the process on 7 more
>> servers before I can finish populating my cluster...
>>
>> In case anyone is wondering how I switched to a 2K block size, this is
>> what I added to my ceph.conf:
>>
>> [osd]
>> osd_mount_options_xfs = "rw,noatime,inode64"
>> osd_mkfs_options_xfs = "-f -b size=2048"
>>
>>
>> The cluster is currently running the 0.71 release.
>>
>> Bryan
>>
>> On Mon, Oct 21, 2013 at 2:39 PM, Bryan Stillwell
>> <bstillwell@xxxxxxxxxxxxxxx> wrote:
>> > So I'm running into this issue again and after spending a bit of time
>> > reading the XFS mailing lists, I believe the free space is too
>> > fragmented:
>> >
>> > [root@den2ceph001 ceph-0]# xfs_db -r "-c freesp -s" /dev/sdb1
>> >    from      to extents  blocks    pct
>> >       1       1 85773 85773   0.24
>> >       2       3  176891  444356   1.27
>> >       4       7  430854 2410929   6.87
>> >       8      15 2327527 30337352  86.46
>> >      16      31   75871 1809577   5.16
>> > total free extents 3096916
>> > total free blocks 35087987
>> > average free extent size 11.33
>> >
>> >
>> > Compared to a drive which isn't reporting 'No space left on device':
>> >
>> > [root@den2ceph008 ~]# xfs_db -r "-c freesp -s" /dev/sdc1
>> >    from      to extents  blocks    pct
>> >       1       1  133148  133148   0.15
>> >       2       3  320737  808506   0.94
>> >       4       7  809748 4532573   5.27
>> >       8      15 4536681 59305608  68.96
>> >      16      31   31531  751285   0.87
>> >      32      63     364   16367   0.02
>> >      64     127      90    9174   0.01
>> >     128     255       9    2072   0.00
>> >     256     511      48   18018   0.02
>> >     512    1023     128  102422   0.12
>> >    1024    2047     290  451017   0.52
>> >    2048    4095     538 1649408   1.92
>> >    4096    8191     851 5066070   5.89
>> >    8192   16383     746 8436029   9.81
>> >   16384   32767     194 4042573   4.70
>> >   32768   65535      15  614301   0.71
>> >   65536  131071       1   66630   0.08
>> > total free extents 5835119
>> > total free blocks 86005201
>> > average free extent size 14.7392
>> >
>> >
>> > What I'm wondering is if reducing the block size from 4K to 2K (or 1K)
>> > would help?  I'm pretty sure this would take require re-running
>> > mkfs.xfs on every OSD to fix if that's the case...
>> >
>> > Thanks,
>> > Bryan
>> >
>> >
>> > On Mon, Oct 14, 2013 at 5:28 PM, Bryan Stillwell
>> > <bstillwell@xxxxxxxxxxxxxxx> wrote:
>> >>
>> >> The filesystem isn't as full now, but the fragmentation is pretty low:
>> >>
>> >> [root@den2ceph001 ~]# df /dev/sdc1
>> >> Filesystem           1K-blocks      Used Available Use% Mounted on
>> >> /dev/sdc1            486562672 270845628 215717044  56% /var/lib/ceph/osd/ceph-1
>> >> [root@den2ceph001 ~]# xfs_db -c frag -r /dev/sdc1
>> >> actual 3481543, ideal 3447443, fragmentation factor 0.98%
>> >>
>> >> Bryan
>> >>
>> >> On Mon, Oct 14, 2013 at 4:35 PM, Michael Lowe <j.michael.lowe@xxxxxxxxx> wrote:
>> >> >
>> >> > How fragmented is that file system?
>> >> >
>> >> > Sent from my iPad
>> >> >
>> >> > > On Oct 14, 2013, at 5:44 PM, Bryan Stillwell <bstillwell@xxxxxxxxxxxxxxx> wrote:
>> >> > >
>> >> > > This appears to be more of an XFS issue than a ceph issue, but I've
>> >> > > run into a problem where some of my OSDs failed because the filesystem
>> >> > > was reported as full even though there was 29% free:
>> >> > >
>> >> > > [root@den2ceph001 ceph-1]# touch blah
>> >> > > touch: cannot touch `blah': No space left on device
>> >> > > [root@den2ceph001 ceph-1]# df .
>> >> > > Filesystem           1K-blocks      Used Available Use% Mounted on
>> >> > > /dev/sdc1            486562672 342139340 144423332  71% /var/lib/ceph/osd/ceph-1
>> >> > > [root@den2ceph001 ceph-1]# df -i .
>> >> > > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> >> > > /dev/sdc1            60849984 4097408 56752576    7% /var/lib/ceph/osd/ceph-1
>> >> > > [root@den2ceph001 ceph-1]#
>> >> > >
>> >> > > I've tried remounting the filesystem with the inode64 option like a
>> >> > > few people recommended, but that didn't help (probably because it
>> >> > > doesn't appear to be running out of inodes).
>> >> > >
>> >> > > This happened while I was on vacation and I'm pretty sure it was
>> >> > > caused by another OSD failing on the same node.  I've been able to
>> >> > > recover from the situation by bringing the failed OSD back online, but
>> >> > > it's only a matter of time until I'll be running into this issue again
>> >> > > since my cluster is still being populated.
>> >> > >
>> >> > > Any ideas on things I can try the next time this happens?
>> >> > >
>> >> > > Thanks,
>> >> > > Bryan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com