HDFS

SAS Viya (5)

教學目標

初步了解 SAS Viya 平台如何設定從 HDFS 中讀取 CSV 檔案。

重點概念

今天在架構師的耐心指導下,了解到文件手冊和系統設定一定要再三確認,不然當發生錯誤時,就會浪費很多時間除錯,此篇主要說明當安裝完 SAS Viya 面臨無法從 HDFS 中讀取 CSV 檔案的情況時,該如何解決,以下所有內容皆在「SAS Viya 3.2 部署手冊」 中詳細說明。

首先當我們嘗試透過 SAS Visual Analytics 讀取 HDFS 時發生「無法連線至 Hadoop 群集」,此問題主要是 Hadoop 設定還未完成,此時請參考「SAS Viya 3.2 部署手冊」中的附錄E 之「Configure the Existing Cloudera Hadoop Cluster to Interoperate with the CAS Server」內容,按照步驟進行,主要步驟有:

確認關鍵檔案

請先確認 /opt/sas/viya/home/SASFoundation/hdatplugins 資料夾中是否具備下述八個檔案:

  1. /opt/sas/viya/home/SASFoundation/hdatplugins/SAS_VERSION
  2. /opt/sas/viya/home/SASFoundation/hdatplugins/sas.cas.hadoop.jar
  3. /opt/sas/viya/home/SASFoundation/hdatplugins/sas.grid.provider.yarn.jar
  4. /opt/sas/viya/home/SASFoundation/hdatplugins/sas.lasr.hadoop.jar
  5. /opt/sas/viya/home/SASFoundation/hdatplugins/sascasfd
  6. /opt/sas/viya/home/SASFoundation/hdatplugins/sashdfsfd
  7. /opt/sas/viya/home/SASFoundation/hdatplugins/start-namenode-cas-hadoop.sh
  8. /opt/sas/viya/home/SASFoundation/hdatplugins/start-datanode-cas-hadoop.sh

若檔案不存在則請透過 sudo rpm -i /sas-hdatplugins-03.00.00-20160315.083831547133.x86_64.rpm 進行安裝。

確認 Hadoop Home 目錄

請確認 Hadoop Home 目錄是否為 /opt/cloudera/parcels/CDH-version/lib/hadoop ,此時可以查看 /opt/sas/viya/home/SASFoundation/cas.settings 中的設定是否正確。

放置 JAR 檔案至每台 Hadoop 機器

請將下述三個 Jar 檔案放置每台 Hadoop 機器上的 /opt/cloudera/parcels/CDH-version/lib/hadoop/lib 目錄中,請注意若沒有放置檔案,則會在 Cloudera Manager 管理網站重新啟動 HDFS 服務時會發生錯誤。

  1. sas.cas.hadoop.jar
  2. sas.lasr.hadoop.jar
  3. sas.grid.provider.yarn.jar

放置執行檔案至每台 Hadoop 機器

請將下述四個執行檔案放置每台 Hadoop 機器上的 /opt/cloudera/parcels/CDH-version/lib/hadoop/bin 目錄中,請注意若少了 sascasfd 檔案,則會發生「HDFS 檔案遺漏資料區塊」的錯誤。

  1. sashdfsfd
  2. sascasfd
  3. start-namenode-cas-hadoop.sh
  4. start-datanode-cas-hadoop.sh

放置 SAS_VERSION 檔案至每台 Hadoop 機器

請將 SAS_VERSION 檔案放置每台 Hadoop 機器上的 /opt/cloudera/parcels/CDH-version/lib/hadoop 目錄中。

設定 HDFS 參數

開啟 Cloudera Manager 管理網站登入 Administrator 管理者帳密設定 HDFS 的參數

參數名稱 屬性值
dfs.namenode.plugins com.sas.cas.hadoop.NameNodeService
dfs.datanode.plugins com.sas.cas.hadoop.DataNodeService

接著針對 Service-Wide Group 中 Advanced 下的 HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml 屬性設定 XML,可以點選 「View as XML」進行設定。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<property>
<name>com.sas.cas.service.allow.put</name>
<value>true</value>
</property>
<property>
<name>com.sas.cas.hadoop.service.namenode.port</name>
<value>15452</value>
</property>
<property>
<name>com.sas.cas.hadoop.service.datanode.port</name>
<value>15453</value>
</property>
<property>
<name> dfs.namenode.fs-limits.min-block-size</name>
<value>0</value>
</property>

再來針對 Gateway Default Group 中 Advanced 下的 HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml 屬性設定 XML。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<property>
<name>com.sas.cas.service.allow.put</name>
<value>true</value>
</property>
<property>
<name>com.sas.cas.hadoop.service.namenode.port</name>
<value>15452</value>
</property>
<property>
<name>com.sas.cas.hadoop.service.datanode.port</name>
<value>15453</value>
</property>
<property>
<name> dfs.namenode.fs-limits.min-block-size</name>
<value>0</value>
</property>

最後針對 Gateway Default Group 中 Advanced 下的 HDFS Client Environment Advanced Configuration Snippet for hadoop-env.sh (Safety Valve) 屬性設定 JAVA_HOME 的變數,文件手冊中主要是寫 JAVA_HOME=/usr/lib/java/jdk1.7.0_07,但若為 JDK 8.0 ,則會是 JAVA_HOME=/usr/lib/java/jdk1.8.0,所以請特別注意 JAVA 路徑是否正確。當確認 HDFS 參數設定無誤,就能夠按下「Save」進行儲存。

重新啟動 HDFS 服務

開啟 Cloudera Manager 管理網站登入 Administrator 管理者帳密重新啟動 HFDS 服務

驗證 CAS SASHDAT 存取 HDFS

在 HDFS 中建立 test 目錄,設定權限為 777。

1
2
hadoop fs -mkdir /test
hadoop fs -chmod 777 /test

開啟 SAS Studio 網站執行下述程式驗證 CAS SASHDAT 是否能夠正常存取 HDFS。

1
2
3
4
5
6
cas mysession;
caslib testhdat datasource=(srctype="hdfs") path="/test";
proc casutil;
load data=sashelp.zipcode;
save casdata="zipcode" replace;
run;

若出現以下訊息就代表成功。

1
NOTE: Cloud Analytic Services saved the file zipcode.sashdat to HDFS in caslib
TESTHDAT.

驗證從 HDFS 中讀取 CSV 檔案

主要開啟 SAS Visual Analytics 網站,點選項目選單中「資料」,此時會頁面會轉至 SAS Environment Manager 網站中,此時我們就能夠從「資料館」中點選 HDFS 資料來源中的 CSV 檔案載入至記憶體中,請注意若載入過程中發生「HDFS 檔案遺漏資料區塊」的錯誤,那請確認是否缺少 sascasfd 檔案至 /opt/cloudera/parcels/CDH-version/lib/hadoop/bin 目錄中

此外本篇的應用情境在文件中也有提到,主要內容為 CAS SASHDAT 可以存取 HDFS , 其中 SASHDAT 檔案格式支援 SAS 格式,並且透過 CAS 最佳化記憶體處理,同時允許使用者儲存和讀取 HDFS 中的資料,請注意需要透過 SAS 針對 Hadoop 的外掛檔案在 Hadoop 節點機器上的設定才能夠完成此情境的需求,也就是本篇的重點

總結無論是「放置外掛檔案」或「HDFS 參數設定」總共九個步驟請一定要確認三次以上,因為我只確認一次沒看仔細文件手冊的關鍵內容,導致後續錯誤問題發生,浪費大家的時間,好在有架構師的耐心指導才能夠解決,真該好好反省,下次不能再犯同樣的錯誤了,若是我們設定無誤理應就能夠在 SAS Viya 中從 HDFS 中讀取 CSV 等不同格式的檔案資料,以利實踐大數據管理與進階分析的應用。

相關資源

SAS 基本介紹 (3)

基本介紹

教學目標

初步了解 SAS LASR 分析伺服器如何在記憶體中進行資料分析。

重點概念

SAS LASR 分析伺服器是一個分析平台,主要讓多個使用者安全且並行存取被載入至記憶體中的資料,最大優勢在於分散式運算的環境,並且在多台機器上的負載會執行大量的並行處理。主要提供兩種方式進行進行小資料集和大數據的分析,當進行分析時主要會讀取資料表至記憶體中進行高效能的處理。

  1. 從 Tables 和 Data Sets 中讀取資料。
  2. 從 Co-located Data Provider 或 HDFS 中讀取資料。

此外還能夠整合不同資料來源的資料倉儲,像是 Teradata ,主要會將資料轉換成 SAS 資料集之後再轉入至記憶體中,以及更新 HDFS 中相關的 SAS 資料集 (HDFS 不支援 APPEND),而當 LASR Analytic Server 掛點時重開之後則可從 HDFS 或 SAS 資料集進行還原,至於與 SAS Visual Analytics 有關的高層次架構,請參考下圖。

SAS Visual Analytics 高層次架構

最後透過 SAS LASR Monitor 服務 (Grid Monitor) 即可用來監控伺服器的狀態,若當發現伺服器執行異常或損毀時,則按照以下六個步驟還原 SAS LASR 分析伺服器。

  1. 停止 SAS LASR 分析伺服器和 SAS LASR Monitor 服務。
  2. 尋找 TKGrid 的 Session (ps -ef | grep TKgrid)。
  3. 刪除 TKGrid 的 Session (kill -9 [pid])。
  4. 重新啟動 SAS LASR Monitor 服務。
  5. 重新啟動 SAS LASR 伺服器。
  6. 重新載入資料至記憶體中。

相關資源

資料分析 Big Data (2)

基本介紹

教學目標

初步了解 Big Data 技術應用領域的相關論文。

重點概念

Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat
Google, Inc.

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

HDFS

The Google File System

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Google, Inc.

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

HBase

Bigtable: A Distributed Storage System for Structured Data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
Google, Inc.

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

Pig

Pig Latin: A Not-So-Foreign Language for Data Processing
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins
Yahoo! Research

There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.

Hive

Hive: A Petabyte Scale Data Warehouse using Hadoop.

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, Raghotham Murthy
Facebook Inc.

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore – that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.

Giraph

Pregel: A System for Large-Scale Graph Processing
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski
Google, Inc.

Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs—in some cases billions of vertices, trillions of edges—poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertexcentric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distributionrelated details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

Mahout

Map-Reduce for Machine Learning on Multicore

Cheng-Tao Chu ,Sang Kyun Kim ,Yi-An Lin ,YuanYuan Yu ,Gary Bradski ,Andrew Y. Ng ,Kunle Olukotun
Stanford University

We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.

Spark

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica
University of California, Berkeley

MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica
University of California, Berkeley

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

Mesos

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
University of California, Berkeley

We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today’s frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

YARN

Apache Hadoop YARN: Yet Another Resource Negotiator

Vinod Kumar Vavilapalli and Arun C Murthy (Hortonworks), Chris Douglas (Microsoft), Sharad Agarwal (Inmobi), Mahadev Konar (Hortonworks), Robert Evans, Thomas Graves, and Jason Lowe (Yahoo!), Hitesh Shah, Siddharth Seth, and Bikas Saha (Hortonworks), Carlo Curino (Microsoft), Owen O’Malley and Sanjay Radia (Hortonworks), Benjamin Reed (Facebook), and Eric Baldeschwieler (Hortonworks)

The initial design of Apache Hadoop was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agora´ —the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop’s compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task faulttolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several.

BigQuery

Dremel: Interactive Analysis of Web-Scale Datasets

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis
Google, Inc.

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

相關資源