Big Data and Cloud Computing

By Tsai Li Ming
8 March 2016 @ SMU

Source: https://github.com/tsailiming/smu-talk-8mar2016

About Me

  • Software Engineer | Solution Architect
  • Focus on Big Data and Cloud Computing
  • Python User Group Singapore (Submit a talk for PyCon SG 2016!)

What is Cloud Computing

Cloud computing, also on-demand computing, is a kind of Internet-based computing that provides shared processing resources and data to computers and other devices on demand. (Wikipedia)

  • Servers
  • Networking
  • Storage
  • Software or Applications
  • Virtualized or Containers
  • Managed

Who are the players?

Public and Private Cloud? Hybrid?

3 Vs of Big Data?

Volume

The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not. (TB -> PB -> EB -> ZB -> YB)

Variety

The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.

Velocity

In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.

Variability

Inconsistency of the data set can hamper processes to handle and manage it.

Veracity

The quality of captured data can vary greatly, affecting accurate analysis.

Where is the Data Coming From?

Computers Human
  • Sensors
  • Images
  • Videos
  • M2M
  • Social Network
  • Wearables
  • Emails
  • Documents

Digital Data Explosion

Making Sense of the Data

Use Case: Aviation

Beneifts of Analytics

  • Analyzing data =>
  • Predictive Maintenance =>
  • Preventive Maintenance =>
  • Increase customer satisfaction, better fuel efficiency, etc

Hadoop Ecosystem

Source: Hortonworks</Div>

More Info: https://hadoopecosystemtable.github.io/

Scaling Up vs Out

  • CPU? Memory? Storage? Network?
  • Vertically up
  • Horizontally across
  • Issues? Limitations?

Scaling Your Code

  • Parallel Programming
  • Distributed Computing

Amazon Elastic MapReduce (EMR)

  • Managed Platform
  • MapReduce, Apache Spark, Hive, etc
  • Launch a cluster in minutes
  • Open source distribution or MapR
  • Elasticity of EC2 (Pay by the hour or Spot instances)

https://aws.amazon.com/elasticmapreduce/

Benefits of Elasticity

Source: Amazon

Hadoop Applications on EMR

  • Apache Hadoop
  • Apache Hive
  • Apache Mahout
  • Apache Pig
  • Apache Spark
  • Hue
  • Ganglia
  • Any many more

There are commerical Hadoop distributions too.

Questions? Break for 5-10 mins?