Hey, guys in this article I am going to explain What is big data in the simplest possible way with some examples of how big MNC like Google, Facebook, Amazon, etc. are managing huge amounts of data. So let’s begin.
What is Big Data?
If you are a beginner and reading this article you might hear the term “Big Data” multiple times and you might think of it as the name of some technology. But in reality, Big Data isn’t the name of technology, it isn’t even a technology. It is the name of the umbrella of the problem which are big MNC’s like Google, Facebook Amazon, etc. are facing.
So, What exactly the problem is? But before I explain to you the problem that comes under Big Data let’s first understand what is Big Data?
Big Data is a term that simply means large or huge amounts of data. Today’s world is all about data. Every day hundred’s of thousands of terabytes of data is generating for from various source like social media, online shopping, web search, etc. Handling, storing, analyzing, and making use of that huge amount of data is becoming a very difficult and big issue. But how and what are the problems?
Let’s understand two major problems cause because of Big Data.
Note:- There are a lot of problems caused because of big data But I am only explaining two to keep the article short and understandable from the beginners perspective. You might hear about the 5V model which is all about the problem caused because of big data two of which I am explaining here.
Volume is an issue that is all about the storing of data generated from various sources. Large amounts of data generated from various sources need to be store somewhere permanently or in a persistent manner and to store data we need some storage device.
To store 1000 terabytes or petabytes of data needs a huge storage device which is obviously very costly plus in today’s digital era that amount of data generated every day from various sources. According to some sources, Facebook Alone generates 4 petabytes of data per day.
Now you might say Google or Facebook or Amazon are big MNC’s, generate revenue of billions of dollars per year. They can easily buy these storage devices which now Lead two new problem
A. Single point of failure meaning storage devices is the hardware and hardware that need maintenance from time to time. Plus they also get damage sometimes due to various reasons and there is no hardware manufacturing company in the world that can give a lifetime guarantee of their product. As storage devices are hardware and our data stored there. If they fail well bye-bye to our data.
According to some sources, 70% of small firms go out of business due to large data loss incidents and Big MNC’s don’t want the same thing to happen to them too.
B. Speed:- Huge storage also leads to another issue which is the speed of I/O (input and output)operation or in simple terms reading and writing operation of data. you could also say velocity to that issue if you ever study physic. Velocity is another big issue that comes under big data.
2. Velocity :-
When we write some data into Hard Disk it took some considerable time similarly When we read data from hard disk it took some time too. You might have seen any of the below images
Now as we know A huge amount of data is generated every day. Writing data into a hard disk and reading it back consume a considerable amount of time and today’s world all about speed. Big tech giants like Facebook, Google, Amazon not only have to just store data but they have to make it available to users plus make use of it or in little technical terms process and analyze data for their business purpose.
To solve Both of the above challenges a concept was introduced called “Distributed storage”. Hadoop is one of the products or software based on Distributed storage.
Likewise to process that huge amount of data we need great computing power so either these MNC’s should buy or have supercomputers or instead can use the concept of “Distributed Computing Power” and Hadoop is one of the products or software based on Distributed computing. In Hadoop MapReduce is used for Big data Processing.
Note:- By Above statement I am not saying Big MNC’s don’t have super computer.
Let’s have a look at some facts about data generation.
Worldwide, people are already generating 2.5 quintillion bytes of data each day.
At the beginning of 2020, the digital universe was estimated to consist of 44 zettabytes of data.
By 2025, approximately 463 exabytes would be created every 24 hours worldwide.
Facebook puts 2.7 billion like, 300 million photos per day.
Google, Facebook, Microsoft, and Amazon are custodians of at least 1,200 petabytes of people’s information.
On Tweeter about 6000 tweets posted per second(generate 12 Terabyte of data), about 500 million tweets per day (generate 84Terabyte of data)and 200 billion tweets per year ((generate 4.3 petabyte of data)).
SOME CASE STUDIES
The above facts and concepts bring us to the Question of how big MNC’s Google, Facebook, Amazon, Etc. are Handling that huge amount of data or I can say Big Data because data is the secret ingredient that fuels recommendation, prediction and decisions.
Well, who doesn’t knows ‘t about Netflix? Netflix is one of the most popular on-demand online video content streaming platform used by people around the globe. According to Netflix, over 75% of viewer activity is based on personalized recommendations. It collects customer data to understand the specific needs, preferences, and taste patterns of users, Then it uses this data to predict what individual users will like and create personalized content recommendation lists for them.
Now you might say “I already know about Netflix But I want to know how Netflix process data”
How Netflix handling Big Data?
Well, Netflix uses data processing software and traditional business intelligence tools such as Hadoop and Teradata and its own open-source solutions such as Lipstick and Genie (Genie is a completely open-source developed by Netflix to handle increasingly massive data), to gather, store, and process massive amounts of information.
Instead of using the traditional data warehouse approach to store data, Netflix uses Amazon S3 to warehouse (in much more simple terms store) data. Netflix Spin multiple Hadoop clusters for different purposes use the same data store in the Amazon s3. In the Hadoop ecosystem, Netflix uses Hive (for ad hoc queries, analytics, etc. ) and Pig (for ETL).
Just like Netflix Twitter is also a very well-known social media platform where users can post messages which can only consist of 140 characters and those messages are known with word tweets. Now if you have read the above facts I have written you might know the amount of data Twitter generates. Now a big question occurs how Twitter manage that huge amount of data or we could say Big Data.
How Twitter handling Big Data?
To Manage data Twitter uses Hadoop for both storage and compute power. Twitter has many clusters among them cluster with 10K nodes considers to be the biggest cluster. Twitter also uses Gizzard which is an open-source framework developed by Twitter for creating distributed datastores. Tweets store in T-birds which build on the top of Gizzard. Twitter also developed FlockDB which is built on the top of MySql to store graph information. Now you might say “What the hell is graph information now?” so Twitter is not only about tweets but more than that it stores so much information so that Twitter can generate trends, events, related people, etc. Snowflake is a database that is used to generate Unique ID for each tweet.
Well, who doesn’t know Facebook, or I should say who doesn’t use Facebook because as per some sources there are almost half of the human population using Facebook. Now with that number, it is quite clear what amount of huge data Facebook has to handle. But if you still like to know the number you can read the facts I mentioned above if you missed them. Now I can also say Facebook is a goldmine of data.
How Facebook handling Big Data?
Unlike some big-name like NetFlix, Dropbox, Facebook has its own infrastructure which Facebook uses to store an ocean of data and manage it. Facebook uses Hadoop and according to Jay Parikh, Vice President Infrastructure Engineering, Facebook “Facebook runs the world’s largest Hadoop cluster”. Facebook collects a lot of unstructured data and to speed up the analysis part Facebook developed Scuba. Facebook also uses Cassandra, Hive, Prism, etc.
Want to know More?
If you want to read or like know some more amazing case study please about Big Data I recommend you to visit the following Link Mention Below:-
Data is everywhere and its amount will increase exponentially in near future. To store and process data you don’t need a supercomputer as I mentioned above. You just need to know the right concept and some tools. Just like Twitter did. Instead of building the Database from scratch, they built FlockDB on the top of MYSQL to store much more complex data that MySQL can’t store.
For Further Queries, Suggestions feel free to connect me on LinkedIn or Comment below.
Hi, This is Ayush Garg. I believe in simplicity. Life and stuff are already complicated so why make it more complicated and complex then they already are. I try to make things as simple as I can.
If you like it then please Clap & Share ..
Thank you Every One For reading .!!