Spark和Hadoop是友,非敵

Spark 在 6 月份取得了激動(dòng)人心的成績(jī)。在圣何塞舉辦的 Hadoop 峰會(huì)上,Spark 成了人們經(jīng)常提及的話題和許多演講的主題。IBM 還在 6 月 15 號(hào)宣布,將對(duì) Spark 相關(guān)的技術(shù)進(jìn)行巨額投資。

這一聲明幫助推動(dòng)了舊金山 Spark 峰會(huì) 的召開(kāi)。在這里,人們會(huì)看到有越來(lái)越多的工程師在學(xué)習(xí) Spark,也有越來(lái)越多的公司在試驗(yàn)和采用 Spark。

對(duì) Spark 的投資和采用形成了一個(gè)正向循環(huán),迅速推動(dòng)這一重要技術(shù)的成熟和發(fā)展,讓整個(gè)大數(shù)據(jù)社區(qū)受益。然而,人們對(duì) Spark 的日益關(guān)注讓一些人產(chǎn)生了奇怪、固執(zhí)的誤解:即 Spark 能取代 Hadoop,而不是對(duì) Hadoop 的補(bǔ)充。這一誤解從《公司紛紛拋棄大數(shù)據(jù)技術(shù) Hadoop》這樣的新聞標(biāo)題上就能看出來(lái)。

作為大數(shù)據(jù)長(zhǎng)期踐行者、現(xiàn)任大數(shù)據(jù)即服務(wù)公司首席執(zhí)行官,我想就這一誤解發(fā)表看法,進(jìn)行一些澄清。

Spark 和 Hadoop 配合得很好。

Hadoop 正日益成為公司處理大數(shù)據(jù)的企業(yè)平臺(tái)之選。Spark 則是運(yùn)行在 Hadoop 之上的內(nèi)存中處理解決方案。Hadoop 最大的用戶(包括易趣和雅虎)都在自己的 Hadoop 集群中運(yùn)行 Spark。Cloudera 和 Hortonworks 在其 Hadoop 包中也加入了 Spark。我們 Altiscale 的客戶在我們最開(kāi)始推出時(shí)就使用運(yùn)行著 Spark 的 Hadoop。

將 Spark 放到 Hadoop 的對(duì)立面就像是在說(shuō)你的新電動(dòng)車非常酷,根本不需要電一樣。但事實(shí)上,電動(dòng)車會(huì)推動(dòng)對(duì)更多電力的需求。

為什么會(huì)產(chǎn)生這種混淆?如今的 Hadoop 由兩大部分組成。第一部分是名為 Hadoop 分布式文件系統(tǒng)(HDFS)的大規(guī)模存儲(chǔ)系統(tǒng),該系統(tǒng)能高效、低成本地存儲(chǔ)數(shù)據(jù),且針對(duì)大數(shù)據(jù)的容量、多樣性和速度進(jìn)行了優(yōu)化。第二部分是名為 YARN 的計(jì)算引擎,該引擎能在 HDFS 存儲(chǔ)的數(shù)據(jù)上運(yùn)行大量并行程序。

YARN 能托管任意多的程序框架。最初的框架是由谷歌發(fā)明的 MapReduce,用來(lái)幫助處理海量網(wǎng)絡(luò)抓取數(shù)據(jù)。Spark 是另一個(gè)這樣的框架,還有一個(gè)名為 Tez 的新框架。當(dāng)人們談?wù)?Spark 與 Hadoop 的“對(duì)決”時(shí),他們實(shí)際上是在說(shuō)現(xiàn)在程序員們更喜歡用 Spark 了,而非之前的 MapReduce 框架。

但是,MapReduce 不應(yīng)該和 Hadoop 等同起來(lái)。MapReduce 只是 Hadoop 集群處理數(shù)據(jù)的諸多方式之一。Spark 可以替代 MapReduce。商業(yè)分析們會(huì)避免使用這兩個(gè)本來(lái)是供程序員使用的底層框架。相反,他們運(yùn)用 SQL 等高級(jí)語(yǔ)言來(lái)更方便地使用 Hadoop。

在過(guò)去四年中,基于 Hadoop 的大數(shù)據(jù)技術(shù)涌現(xiàn)出了讓人目不暇接的創(chuàng)新。Hadoop 從批處理 SQL 進(jìn)化到了交互操作;從一個(gè)框架(MapReduce)變成了多個(gè)框架(如 MapReduce、Spark 等)。

HDFS 的性能和安全也得到了巨大改進(jìn),在這些技術(shù)之上出現(xiàn)了眾多工具,如 Datameer、H20 和 Tableau。這些工具極大地?cái)U(kuò)大了大數(shù)據(jù)基礎(chǔ)設(shè)施的用戶范圍,讓數(shù)據(jù)科學(xué)家和企業(yè)用戶也能使用。

Spark 不會(huì)取代 Hadoop。相反,Hadoop 是 Spark 的基石。隨著各個(gè)組織尋求運(yùn)用范圍最廣、最健壯的平臺(tái)來(lái)將自己的數(shù)據(jù)資產(chǎn)轉(zhuǎn)變?yōu)榭尚袆?dòng)的商業(yè)洞見(jiàn),它們對(duì) Hadoop 和 Spark 技術(shù)的采用也會(huì)越來(lái)越多。

英語(yǔ)原文:

June was an exciting month for Apache Spark. At Hadoop Summit San Jose, it was a frequent topic of conversation, as well as the subject of many session presentations. On June 15, IBM announced plans to make a massive investment in Spark-related technology.

This announcement helped kick off the Spark Summit in San Francisco, where one could witness the increasing number of engineers learning about Spark — and the increasing number of companies experimenting with and adopting Spark.

The virtuous cycle of Spark investment and adoption is driving rapidly the maturity and capabilities of this important technology, to the benefit of the entire big data community. However, the growing attention directed toward Spark also has given rise to a strange and stubborn misconception: that Spark is somehow an alternative to Apache Hadoop, instead of a complement to it. This misconception can be seen in headlines like “Newer Software Aims to Crunch Hadoop’s Numbers” and “Companies Move On From Big Data Technology Hadoop.”

As a long-time big data practitioner, an early advocate for investment in Hadoop by Yahoo! and now CEO of a company that provides big data as a service for the enterprise, I’d like to bring some perspective and clarity to this conversation.

Spark and Hadoop work together.

Hadoop is increasingly the enterprise platform of choice for big data. Spark is an in-memory processing solution that runs on top of Hadoop. The largest users of Hadoop — including eBay and Yahoo! — both run Spark inside their Hadoop clusters. Cloudera and Hortonworks ship Spark as part of their Hadoop distributions. And our own customers here at Altiscale have been using Spark on Hadoop since we launched.

To position Spark in opposition to Hadoop is like saying that your new electric car is so cool that you won’t need electricity anymore. If anything, electric cars will drive demand for more electricity.

Why the confusion? Modern-day Hadoop consists of two main components. The first is a large-scale storage system called the Hadoop Distributed File System (HDFS), which stores data in a low-cost, high-performance manner optimized for the volume, variety and velocity of big data. The second component is a computation engine called YARN, which can run massively parallel programs on top of the data stored in HDFS.

YARN can host any number of programming frameworks. The original such framework was MapReduce, invented at Google to help process massive web crawls. Spark is another such framework, as is another new one called Tez. When people talk about Spark “crushing” Hadoop, what they really mean is that programmers now prefer using Spark to the older MapReduce framework.

However, MapReduce should not be equated with Hadoop. MapReduce is just one of many ways to process your data in a Hadoop cluster. Spark can be used as an alternative. Looking more broadly, business analysts — a growing base of big data practitioners — avoid both of these frameworks, which are low-level toolkits meant for programmers. Instead, they use high-level languages like SQL that make Hadoop more accessible.

In the last four years, Hadoop-based big data technology has seen an unprecedented level of innovation. We’ve gone from batch SQL to interactive; from one framework (MapReduce) to multiple frameworks (e.g., MapReduce, Spark and many others).

We’ve seen enormous performance and security improvements in HDFS, and we’ve seen an explosion of tools that sit on top of all of this — such as Datameer, H20 and Tableau — that make all of this big data infrastructure usable by a far broader range of data scientists and business users.

Spark isn’t a challenger that’s going to replace Hadoop. Rather, Hadoop is a foundation that makes Spark possible. We expect to see increasing adoption of both as organizations seek the broadest and most robust platform possible for turning their data assets into actionable business insight.

翻譯:1thinc0 via:techcrunch

End.

免責(zé)聲明:本網(wǎng)站內(nèi)容主要來(lái)自原創(chuàng)、合作伙伴供稿和第三方自媒體作者投稿,凡在本網(wǎng)站出現(xiàn)的信息,均僅供參考。本網(wǎng)站將盡力確保所提供信息的準(zhǔn)確性及可靠性,但不保證有關(guān)資料的準(zhǔn)確性及可靠性,讀者在使用前請(qǐng)進(jìn)一步核實(shí),并對(duì)任何自主決定的行為負(fù)責(zé)。本網(wǎng)站對(duì)有關(guān)資料所引致的錯(cuò)誤、不確或遺漏,概不負(fù)任何法律責(zé)任。任何單位或個(gè)人認(rèn)為本網(wǎng)站中的網(wǎng)頁(yè)或鏈接內(nèi)容可能涉嫌侵犯其知識(shí)產(chǎn)權(quán)或存在不實(shí)內(nèi)容時(shí),應(yīng)及時(shí)向本網(wǎng)站提出書(shū)面權(quán)利通知或不實(shí)情況說(shuō)明,并提供身份證明、權(quán)屬證明及詳細(xì)侵權(quán)或不實(shí)情況證明。本網(wǎng)站在收到上述法律文件后,將會(huì)依法盡快聯(lián)系相關(guān)文章源頭核實(shí),溝通刪除相關(guān)內(nèi)容或斷開(kāi)相關(guān)鏈接。

2015-07-15
Spark和Hadoop是友,非敵
Spark 在 6 月份取得了激動(dòng)人心的成績(jī)。在圣何塞舉辦的 Hadoop 峰會(huì)上,Spark 成了人們經(jīng)常提及的話題和許多演講的主題。IBM 還在 6 月 15 號(hào)宣布,將對(duì) Spark 相關(guān)的技術(shù)

長(zhǎng)按掃碼 閱讀全文