A Very Short Introduction
By Dawn E. Holmes
What is Big Data?
Holmes describes how we the size of the data in itself is cannot be a standalone metric for whether something is under the banner of ‘Big Data’.
Instead, we can use the three V’s — Volume, Variety and Velocity:
- Volume — “Refers to the amount of electronic data that is now collected and stored… In 2012, IBM and and the University of Oxford reporting the findings of their Big Data Work survey. In this international survey of 1,144 professionals working in ninety-five different countries, over half judged datasets of between 1 TB and 1 Pb to be big… The survey asked respondents to choose either one or two defining characteristics of big data from a choice of eight; only 10% voted for ‘big volumes of data’ with the top choice being ‘a greater scope of information’”
- Variety — “Once we are connected to the Web, we have access to a chaotic collection of data, from sources both reliable and suspect, prone to repetition and error.”
- Velocity — “Data is now streaming continuously from sources such as the Web, smartphones and sensors…Velocity also refers to the speed at which data is electronically processed”
Storing Structured Data
Holmes writes how structured data is normally stored within Relational Database Management Systems (RDBMS). A big advantage of Relational Databases is “a process called normalization which includes reducing data duplication to a minimum and hence reduces storage requirements.”
The issue with Relational Databases comes down to their scalability with these databases being traditionally designed to run on a single server with the only way to achieve scalability being to add more computing power — vertical scalability.
Relational Databases do have the advantage of conforming to ACID:
- Atomicity — you guarantee that either all of the transaction succeeds or none of it does.
- Consistency — All data will be valid according to all defined rules, including any constraints, cascades, and triggers that have been applied on the database.
- Isolation — Guarantees that all transactions will occur in isolation. No transaction will be affected by any other transaction. So a transaction cannot read data from any other transaction that has not yet completed.
- Durability — Durability means that, once a transaction is committed, it will remain in the system — even if there’s a system crash immediately following the transaction. Any changes from the transaction must be stored permanently. If the system tells the user that the transaction has succeeded, the transaction must have, in fact, succeeded.
Storing Unstructured Data
RDBMS is inappropriate to store Unstructured Data for several reasons, not least that once the schema has been constructed it is difficult to change while Unstructured Data cannot be conveniently organized into rows and columns.
A distributed file system (DFS) provides efficient and reliable storage for Big Data with Hadoop being one of the most popular DFS. A DFS is made up of a master NameNode which deals with requests from a client and slave DataNodes who are responsible for storage data and CRUD operations. DataNodes scale horizontally allowing far easier elasticity.
Google’s Big Data Flu Failure
Google had a famous failure when applying its Big Data to Flu Prediction in an exercise called “Flu Trends”. Google tracked the top searched Flu terms in order to predict the number of flu cases with this being checked against the Central for Disease Control’s won predictions. Google’s algorithm over-predicted the number of flue cases by at least 50%.