Hi Greg, Thanks. Can you explain more on "Ceph *does* export locations so the follow-up jobs can be scheduled appropriately”? Thanks, Johnu On 9/8/14, 12:51 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote: >On Thu, Sep 4, 2014 at 12:16 AM, Johnu George (johnugeo) ><johnugeo@xxxxxxxxx> wrote: >> Hi All, >> I was reading more on Hadoop over ceph. I heard from Noah that >> tuning of Hadoop on Ceph is going on. I am just curious to know if there >> is any reason to keep default object size as 64MB. Is it because of the >> fact that it becomes difficult to encode >> getBlockLocations if blocks are divided into objects and to choose the >> best location for tasks if no nodes in the system has a complete block.? > >We used 64MB because it's the HDFS default and in some *very* stupid >tests it seemed to be about the fastest. You could certainly make it >smaller if you wanted, and it would probably work to multiply it by >2-4x, but then you're using bigger objects than most people do. > >> I see that Ceph doesn¹t place objects considering the client location or >> distance between client and the osds where data is >>stored.(data-locality) >> While, data locality is the key idea for HDFS block placement and >> retrieval for maximum throughput. So, how does ceph plan to perform >>better >> than HDFS as ceph relies on random placement >> using hashing unlike HDFS block placement? Can someone also point out >> some performance results comparing ceph random placements vs hdfs >>locality >> aware placement? > >I don't think we have any serious performance results; there hasn't >been enough focus on productizing it for that kind of work. >Anecdotally I've seen people on social media claim that it's as fast >or even many times faster than HDFS (I suspect if it's many times >faster they had a misconfiguration somewhere in HDFS, though!). >In any case, Ceph has two plans for being faster than HDFS: >1) big users indicate that always writing locally is often a mistake >and it tends to overfill certain nodes within your cluster. Plus, >networks are much faster now so it doesn't cost as much to write over >it, and Ceph *does* export locations so the follow-up jobs can be >scheduled appropriately. > >> >> Also, Sage wrote about a way to specify a node to be primary for hadoop >> like environments. >> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/1548 ) Is >> this through primary affinity configuration? > >That mechanism ("preferred" PGs) is dead. Primary affinity is a >completely different thing. > > >On Thu, Sep 4, 2014 at 8:59 AM, Milosz Tanski <milosz@xxxxxxxxx> wrote: >> QFS unlike Ceph places the erasure coding logic inside of the client >> so it's not a apples-to-apples comparison. but I think you get my >> point, and it would be possible to implement a rich Ceph >> (filesystem/hadoop) client like this as well. >> >> In summary, if Hadoop on Ceph is a major priority I think it would be >> best to "borrow" the good ideas for QFS and implement them in Hadoop >> Ceph filesystem and Ceph it self (letting a smart client get chunks >> directly, write chunks directly). I don't doubt that it's a lot of >> work but the results might be worth it in in terms of performance you >> get for the cost. > >Unfortunately implementing CephFS on top of RADOS' EC pools is going >to be a major project which we haven't done anything to scope out yet, >so it's going to be a while before that's really an option. But it is >a "real" filesystem, so we still have that going for us. ;) >-Greg >Software Engineer #42 @ http://inktank.com | http://ceph.com >-- >To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >the body of a message to majordomo@xxxxxxxxxxxxxxx >More majordomo info at http://vger.kernel.org/majordomo-info.html ÿ淸º{.nÇ+돴윯돪†+%듚ÿ깁負¥Šwÿº{.nÇ+돴œz˜ÿu銀쀸㎍썳變}©옽Æ zÚ&j:+v돣?®w?듺2듷솳鈺Ú&¢)傘«a뛴ÿÿ鎬z요z받쀺+껠šŽ듶¢jÿŠw療f