Apache Spark

資料分析 Big Data (1)



初步了解 Big Data 改變的未來新趨勢。


Analytics & Machine Learning

Rapid Insights Providing Business Impact

  • “Just-in-time” analytics that can be directly embedded into business processes for business outcome comparisions.
  • Analytical solutions available at point of decision.
  • New solutions must dynamically mix & analyze data from realtime to historical to meet continuous business results - machine learning leveraged.

Best Practice: Apache Spark

Lambda Data Management

Lambda Data - new lens on data systems, designed to tame growing complexity.

  • Defineds set of principles for how batch & stream processing can work together.
    • Human fault-tolerant.
    • Immutability - keep data immutable for the range of business contexts.
    • Pre-computation & re-computataion.
  • Data Handling Layers
    • Batch Layer - stores the master data set. (e.g. Hadoop、HDFS)
    • Server Layer - indexes & offers precomputed views for ad hoc with low lantency queries.
    • Speed Layer - real-time views are incremental - “complexity isolation”, transient handle only transient additions until next batch reompilation.

Best Practice: Google BigQuery

Application Development & Business Integration

Notebook IDEs becoming all rage

  • OSS innovation for web-base, interactive approach for new solution collaboration rising fast - one unified place for team to share insights, business results, nodes, etc…
  • Notebook-as-a-servic - micro services “good enough” for some analytics-based solutions until business leaders need / expect realtime speeds.

Implications For Future Applications

  • Answering open-ended business questions - velocity, variety & volume for big data set new stage
  • Business can deal with close approximations sooner than higher analytics accuracy in hindsight
  • Innovations in data handling & analytics starting to address new class of business applications
    time-to-value - launch product -> continuously analyze business impact-> learn & refine then repeat.

Best Practice: The IPython Notebook


雲端服務 Amazon Web Service (2)



學會如何根據大數據處理階段,選擇最適合的 Amazon 雲端服務。


Big Data Portfolio of Services

在大數據處理的階段根據下表,就能在篩選出最適合的 Amazon 雲端服務。

處理階段 雲端服務
收集 AWS Direct Connect、AWS Import/Export、Amazon Kinesis
儲存 Amazon S3、Amazon DynamoDB、Amazon Glacier
處理 & 分析 Amazon Redshift 、Amazon EMR、Amazon EC2

(參考資源: AWS re:Invent 2014 | (BDT303) Construct ETL Pipeline w/ AWS Data Pipeline, Amazon EMR & Redshift )

Batch Processing

Amazon EMR 雲端服務搭配 Apache Mahout 開源專案進行資料分析處理。

Amazon EMR 雲端服務搭配 Apache Spark 開源專案進行資料分析處理。

Real-time Processing

Amazon Kinesis 雲端服務搭配 Amazon Redshift 雲端服務進行資料分析處理。

Amazon Kinesis 雲端服務搭配 Storm 開源專案進行資料分析處理。