Re: CephFS unexplained writes

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 7 May 2015 21:32:21 -0700

Sam? This looks to be the HashIndex::SUBDIR_ATTR, but I don't know
exactly what it's for nor why it would be getting constantly created
and removed on a pure read workload...

On Thu, May 7, 2015 at 2:55 PM, Erik Logtenberg <erik@xxxxxxxxxxxxx> wrote:
> It does sound contradictory: why would read operations in cephfs result
> in writes to disk? But they do. I upgraded to Hammer last week and I am
> still seeing this.
>
> The setup is as follows:
>
> EC-pool on hdd's for data
> replicated pool on ssd's for data-cache
> replicated pool on ssd's for meta-data
>
> Now whenever I start doing heavy reads on cephfs, I see intense bursts
> of write operations on the hdd's. The reads I'm doing are things like
> reading a large file (streaming a video), or running a big rsync job
> with --dry-run (so it just checks meta-data). No clue why that would
> have any effect on the hdd's, but it does.
>
> Now, to further figure out what's going on, I tried using lsof, atop,
> iotop, but those tools don't provide the necessary information. In lsof
> I just see a whole bunch of files opened at any time, but it doesn't
> change much during these tests.
> In atop and iotop I can clearly see that the hdd's are doing a lot of
> writes when I'm reading in cephfs, but those tools can't tell me what
> those writes are.
>
> So I tried strace, which can trace file operations and attach to running
> processes.
> # strace -f -e trace=file -p 5076
> This gave me an idea of what was going on. 5076 is the process id of the
> osd for one of the hdd's. I saw mostly stat's and open's, but those are
> all reads, not writes. Of course btrfs can cause writes when doing reads
> (atime), but I have the osd mounted with noatime.
> The only write operations that I saw a lot of are these:
>
> [pid  5350]
> getxattr("/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3",
> "user.cephos.phash.contents", "\1Q\0\0\0\0\0\0\0\0\0\0\0\4\0\0", 1024) = 17
> [pid  5350]
> setxattr("/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3",
> "user.cephos.phash.contents", "\1R\0\0\0\0\0\0\0\0\0\0\0\4\0\0", 17, 0) = 0
> [pid  5350]
> removexattr("/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3",
> "user.cephos.phash.contents@1") = -1 ENODATA (No data available)
>
> So it appears that the osd's aren't writing actual data to disk, but
> metadata in the form of xattr's. Can anyone explain what this setting
> and removing of xattr's could be for?
>
> Kind regards,
>
> Erik.
>
>
> On 03/16/2015 10:44 PM, Gregory Farnum wrote:
>> The information you're giving sounds a little contradictory, but my
>> guess is that you're seeing the impacts of object promotion and
>> flushing. You can sample the operations the OSDs are doing at any
>> given time by running ops_in_progress (or similar, I forget exact
>> phrasing) command on the OSD admin socket. I'm not sure if "rados df"
>> is going to report cache movement activity or not.
>>
>> That though would mostly be written to the SSDs, not the hard drives —
>> although the hard drives could still get metadata updates written when
>> objects are flushed. What data exactly are you seeing that's leading
>> you to believe writes are happening against these drives? What is the
>> exact CephFS and cache pool configuration?
>> -Greg
>>
>> On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg <erik@xxxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> I forgot to mention: while I am seeing these writes in iotop and
>>> /proc/diskstats for the hdd's, I am -not- seeing any writes in "rados
>>> df" for the pool residing on these disks. There is only one pool active
>>> on the hdd's and according to rados df it is getting zero writes when
>>> I'm just reading big files from cephfs.
>>>
>>> So apparently the osd's are doing some non-trivial amount of writing on
>>> their own behalf. What could it be?
>>>
>>> Thanks,
>>>
>>> Erik.
>>>
>>>
>>> On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
>>>> Hi,
>>>>
>>>> I am getting relatively bad performance from cephfs. I use a replicated
>>>> cache pool on ssd in front of an erasure coded pool on rotating media.
>>>>
>>>> When reading big files (streaming video), I see a lot of disk i/o,
>>>> especially writes. I have no clue what could cause these writes. The
>>>> writes are going to the hdd's and they stop when I stop reading.
>>>>
>>>> I mounted everything with noatime and nodiratime so it shouldn't be
>>>> that. On a related note, the Cephfs metadata is stored on ssd too, so
>>>> metadata-related changes shouldn't hit the hdd's anyway I think.
>>>>
>>>> Any thoughts? How can I get more information about what ceph is doing?
>>>> Using iotop I only see that the osd processes are busy but it doesn't
>>>> give many hints as to what they are doing.
>>>>
>>>> Thanks,
>>>>
>>>> Erik.
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com