Blog on BigData

Buddhiprakash Jain
9 min readMar 18, 2021

What is Big Data ??

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many fields (columns) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.

The history of Big Data

Although the concept of big data itself is relatively new, the origins of large data sets go back to the 1960s and ’70s when the world of data was just getting started with the first data centers and the development of the relational database.

Around 2005, people began to realize just how much data users generated through Facebook, YouTube, and other online services. Hadoop (an open-source framework created specifically to store and analyze big data sets) was developed that same year. NoSQL also began to gain popularity during this time.

The development of open-source frameworks, such as Hadoop (and more recently, Spark) was essential for the growth of big data because they make big data easier to work with and cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are still generating huge amounts of data — but it’s not just humans who are doing it.

How Big Data works

Big data gives you new insights that open up new opportunities and business models. Getting started involves three key actions:

1. Integrate

Big data brings together data from many disparate sources and applications. Traditional data integration mechanisms, such as ETL (extract, transform, and load) generally aren’t up to the task. It requires new strategies and technologies to analyze big data sets at terabyte, or even petabyte, scale.

During integration, you need to bring in the data, process it, and make sure it’s formatted and available in a form that your business analysts can get started with.

2. Manage

Big data requires storage. Your storage solution can be in the cloud, on premises, or both. You can store your data in any form you want and bring your desired processing requirements and necessary process engines to those data sets on an on-demand basis. Many people choose their storage solution according to where their data is currently residing. The cloud is gradually gaining popularity because it supports your current compute requirements and enables you to spin up resources as needed.

3. Analyze

Your investment in big data pays off when you analyze and act on your data. Get new clarity with a visual analysis of your varied data sets. Explore the data further to make new discoveries. Share your findings with others. Build data models with machine learning and artificial intelligence. Put your data to work.

Characteristics Of Big Data

Big data can be described by the following characteristics:

  • Volume
  • Variety
  • Velocity
  • Variability

(i) Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.

(ii) Variety — The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Variability — This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Benefits of Big Data

Ability to process Big Data brings in multiple benefits, such as-

  • Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.

  • Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.

  • Early identification of risk to the product/services, if any
  • Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.

What Is Big Data Management?

The term “big data” usually refers to data stores characterized by the “3 Vs”: high volume, high velocity and wide variety.

Big data management is a broad concept that encompasses the policies, procedures and technology used for the collection, storage, governance, organization, administration and delivery of large repositories of data. It can include data cleansing, migration, integration and preparation for use in reporting and analytics.

Big data management is closely related to the idea of data lifecycle management (DLM). This is a policy-based approach for determining which information should be stored where within an organization’s IT environment, as well as when data can safely be deleted.

How Google Manage their Big Data:-

Google and any other company which generate huge amount of data uses cloud to store it’s data.

Let’s look at some of Google’s server types and the tasks they are responsible for carrying out.

1. Web Servers

Google’s web servers are those that will probably resonate most with the common user, as they are responsible for handling the queries that we enter into Google Search. When a user enters a query, web servers carry out the process of interacting with other server types (e.g. index, spelling, ad, etc.) and returning results/serving ads in HTML format. Web servers are the ‘results-gathering’ servers, if you will.

2. Data-Gathering Servers

Data-gathering servers do the work of collecting and organizing information for Google. These servers “spider” or crawl the internet via Googlebot (Google’s web crawler), searching for newly-added and existing content. These servers have the responsibility of indexing content, updating the index and ranking pages based on Google’s search algorithms.

3. Index Servers

Google’s index servers are where a lot of the “magic” behind Google Search happens. These servers are responsible for returning lists of document IDs that correspond to “documents” (or indexed web pages) wherein the user’s query is present.

4. Document Servers

Document servers store the document version of web page content. Each page has content saved in the form of JPEG files, PDF files, and more, all of which is stored in several servers depending on the type of information. Document servers provide snippets of information to users based on the search terms entered and are capable of returning entire documents, as well.

The document IDs returned by index servers correspond to documents housed by these servers. Due to the influx of indexed documents each and every day, these servers require more disk space than others. If we were to answer the question — Where does Google store its data? — with one server type, it’d most certainly be the document server.

5. Ad Servers

Ad servers are vital to both Google’s revenue stream and the livelihood of thousands of businesses. These servers are responsible for managing the advertisements that are integral to Google’s AdWords and AdSense services. Web servers interact with these ad servers when deciding which ads (if any) should be displayed for a particular query.

6. Spelling Servers

We didn’t all get A’s in spelling during school and some of us need a little help when searching. If you have ever searched for something in Google and the results came up with the phrase, “Did you mean correctspelling,” know that a spelling server was at work. No matter how search terms are entered, spelling servers work to perform the search anyway, taking advantage of the opportunity to learn, correct and better locate what users seek.

How Facebook Manage their BigData:-

RAM — Facebook has a well-known partitioning issue where it is basically impossible for them to divide-and-conquer the problem by splitting users into clusters where they communicate only among themselves locally. Instead, users communicate with other users in an unpredictable fashion. Facebook solved the problem by creating a monster RAM layer, kept in memcached, with minimum latency where constant updates from a single user to many others can be performed efficiently and in a timely manner.

Linear algorithms — by doing everything in RAM, Facebook can quickly assemble news feed for every user very quickly on-the-fly. But the key is also that the data assembled quickly is processed in linear time, as their algorithms, starting with EdgeRank, are linear. Was that not the case, the problem would have been much more difficult and challenging.

HayStack is also periodically backed up to Hadoop. So that if Haystack goes down it can start back up from where it left off.

They have something called Scribe which is log aggregator. They use a lot of Scribe nodes to aggregate logs and back it up on Hadoop after intervals to process them.

Hadoop is used in Facebook to do a lot of things because of its processing power and its fault tolerance capability.

They use MySql a lot for daily operations and a lot of their components work on Mysql.

MySql at Facebook is programmed in such a way that it pretty much drives itself and there is hardly any maintenance. But because data is huge they periodically move their data to Hadoop and use Hive to process that data. Also not to forget that Facebook is a huge contributor to Apache Hive.

If you Like My above Blog on Big Data, Please Clap,Share it..

In Case of any Suggestions/queries/feedback DM me on LinkedIN.

ThankYou!!

--

--