Definition: “Extremely scalable analytics – analyzing petabytes of structured and unstructured data at high velocity.”
Definition: “Big data is data that exceeds the processing capacity of conventional data base systems.”
Big Data has three characteristics:
Variety – Structured and unstructured data
Velocity – Time sensitive data that should be used simultaneously with its enterprise data counterparts, in order to maximize value
Volume – Size of data exceeds the nominal storage capacity of the enterprise.
Statistics:
– In 2011, the global output of data was estimated to be 1.8 zettabytes (10^21 bytes)
– 90% of the world data has been created in the last 2 years.
– We create 2.5 quintillion (10^18) bytes of data per day (from sensors, social media posts, digital pictures, etc.)
– The digital world will increase in capacity 44 folds between 2009 and 2020.
– Only 5% of data is being created in structured forms, 95% is largely unstructured.
– 80% of the effort involved in dealing with unstructured data is reconditioning ill-formed data to well-formed data (cleaning it up).
Performance Statistics (I will start tracking more closely):
– Traditional data storage costs approximately $5/GB, but storing the same data using Hadoop only cost $0.25/GB – yep 25cents/GB. Hum!
– FaceBook stores more than 20Petabytes of data across 23,000 cores, with 50Terabytes of raw data being generated per day.
– eBay uses over 2,600 clustered Hadoop servers.