Hadoop生态开源组件学习

计算引擎

Apache hadoop-core : hadoop 运行的基础组件，主要是hdfs
Apache MapReduce : 基于hdfs的计算引擎
Apache Hama: 基于Bulk Synchronous Paralle计算模型的分布式计算引擎 http://hama.apache.org/
Apache spark : 新的基于内存的分布式计算引擎，如同MR https://spark.apache.org/
Apache Ignite：基于内存数组的分布式计算引擎 由 GridGain 公司捐赠 https://ignite.apache.org/
Apache tez:新的分布式计算引擎，如同MR https://tez.apache.org/
Apache vxquery :针对xml格式的大数据计算方案 http://vxquery.apache.org/
Apache giraph ：迭代的图计算系统 https://giraph.apache.org/
Apache storm ： 流式计算引擎 http://storm.apache.org/
Apache Samza：分布式的流式计算引擎 http://samza.apache.org/
Apache Apex ： 实时数据流分析任务 https://apex.apache.org/
Apache flink:一种兼容了流式，批量计算的计算引擎 http://flink.apache.org/
IBM Streams : IBM 出品的流式计算引擎 http://www-03.ibm.com/software/products/zh/ibm-streams
Apache reef：运行在yarn之上的机器学习解决方案，由微软开源 http://reef.apache.org/
Apache falcon： feed管理和数据处理的平台 http://falcon.apache.org/

Sql on Hadoop

Apache hive：基于hadoop的数据仓库构建组件 http://hive.apache.org/
Apache drill : 交互式数据查询工具,偏ad-hot，基于google 的 Dremel 论文实现 http://drill.apache.org/
Apache impala: 交互式数据查询工具，偏ad-hot http://impala.apache.org/
Apache Pig： 脚本级操作hadoop的语言。现在基本在衰退 https://pig.apache.org/
Apache Kylin ：OLAP 的大数据查询引擎  http://kylin.apache.org/
Apache tajo : 运行在hadoop之上的数据仓库 http://tajo.apache.org/
Apache Calcite ： sql on hadoop , 提供 OLAP和流处理的查询引擎 http://calcite.apache.org/
Apache hawq:基于MPP计算模型的大数据计算引擎，sql on hadoop, http://hawq.incubator.apache.org/
Hortonworks STINGER.next ： Hortonworks提供的sql on hadoop 解决方案 http://zh.hortonworks.com/products/data-center/hdf/
Facebook Presto : 基于hadoop的sql 查询引擎 https://prestodb.io/

生态工具

Apache mahout : 通用算法的集成包 http://mahout.apache.org/
Apache Avro：一种高效的数据序列化协议 http://avro.apache.org/
Apache zookeeper : 分布式的数据管理工具：统一命名服务、状态同步服务、集群管理、分布式应用配置项的管理等 https://zookeeper.apache.org/
Apache sentry:权限管理工具 https://sentry.apache.org/
Apache parquet:列式存储格式 由cloudera公司推动 http://parquet.apache.org
Apache ORC：自包含，格式敏感的列式存储格式定义 http://orc.apache.org/
Apache knox : 将hadoop的服务，以restAPI 的方式进行暴露 https://knox.apache.org/
Apache DirectMemory：是一个多层的缓存系统，特性包括无堆的内存管理用于支持大规模的 Java 对象，而不会影响 JVM 垃圾收集器的性能 http://directmemory.apache.org/
Apache DataFu (Incubating) ：hadoop 集合类库集合 http://datafu.incubator.apache.org/
Apache Crunch ：简化MapReduce运行的工具类库 https://crunch.apache.org/
Apache MetaModel：解决不同底层格式的hadoop任务执行 http://metamodel.apache.org/
Apache solr : 分布式的搜索框架 http://lucene.apache.org/solr/

Apache Eagle ： hadoop集群的安全监控组件 http://eagle.apache.org/

数据传输

    Apache sqoop:高效的大数据导入工具 http://sqoop.apache.org/
    Apache chukwa:大数据传输组件 http://chukwa.apache.org/
    Apache flume:数据传输工具，解决原始数据进入hdfs http://flume.apache.org/
    fluentd ：数据传输组件 http://www.fluentd.org/
    Scribe : 数据传输组件 https://www.scribesoft.com/

    Hortonworks dataflow : 数据采集工具 http://zh.hortonworks.com/products/data-center/hdf/

集群资源管理

Apache yarn:统一的资源管理分配工具
Apache airavata : 组合，管理，执行，监控分布式任务的资源组件，类似yarn http://airavata.apache.org/
Apache Helix ： 资源管理框架，类似于yarn http://helix.apache.org/

管理平台

Apache bigtop : 继承的hadoop包管理，测试工具 http://bigtop.apache.org/
Apache Hue：Hadoop管理工具，有较好的UI，基本集成了所有的大数据组件，来之CDH的，desktop http://gethue.com/
Apache Ambari ： hadoop 平台搭建工具，创建、管理、监视 Hadoop 的集群 ，由Hortonworks公司提供 https://ambari.apache.org/
Apache zeppelin:数据分析集成的UI解决方案 http://zeppelin.apache.org/

任务调度

Apache oozie :hadoop 任务调度工具 http://oozie.apache.org/
Apache azkaban:另一种任务调度工具 http://azkaban.github.io/azkaban/docs/latest/

数据库

Apache phoenix : 解决OLAP场景的大数据存储及查询 http://phoenix.apache.org/
Apache CouchDB ： 一个使用JSON作为存储格式，JavaScript作为查询语言，MapReduce和HTTP作为API的NoSQL数据库 http://couchdb.apache.org/
Apache kudu:基于Google的Spanner论文的实现，用于处理快速数据的查询和分析,解决了hdfs层面数据查询较慢的问题 http://kudu.apache.org/

消息队列

Apache kafka :分布式消息队列 http://kafka.apache.org/
Apache BookKeeper ： 分布式日志收集工具，类似kafka http://bookkeeper.apache.org/

注意：本文归作者所有，未经作者允许，不得转载

全部评论: 0 条

热门文章

最新发布

最新评论