What is Big Data? Definition, Characteristics, Examples, and Benefits

Learn the definition of Big Data, its main characteristics (5V), examples of its application in everyday life, and its benefits for businesses and organizations.

What is Big Data

Big Data, A term we often hear in processing data and information. But what is Big Data? Big Data is a collection of data that has a large amount. Big Data is usually obtained from data sources that function to produce various important information needed by an individual or organization to make the right decisions.

Big Data has also developed into a scientific field that focuses on processing large and complex data into information that is finally easy to understand by laypeople through techniques such as data visualization and so on. In Big Data there are several stages that are carried out so that data can provide insight into information, namely data retrieval with certain queries, data cleaning, processing data based on certain types, then when the data is ready to be processed using certain queries that produce calculations / create new columns from new data calculations / filter data with certain conditions and finally visualization is carried out to obtain certain conclusions or information.


The benefits of Big Data are increasing productivity and efficiency of human resources in formulating business strategies. For business people and companies, the existence of Big Data can facilitate data management and analysis.

Data Variation in Big Data

In Big Data “Variety” refers to different forms of data. Data forms in Big Data are usually divided into structured and unstructured data. Here are the differences and examples

Structured Data

Structured Data is a type of data that has a well-organized format and is usually stored in a known table or database. This type of data is easy to search and process because it follows a consistent model with clear columns and rows.
Examples of Structured Data:

  • Transaction Data: Information about transactions made in a business, such as daily sales, purchases, or other financial transactions that are recorded systematically.
  • Financial Notes: Financial data such as income statements, balance sheets, and cash flow statements structured in a tabular format that can be easily accessed and analyzed.
  • Customer Database: Information about customers such as names, addresses, phone numbers, and purchase history is stored in a relational database.

Unstructured Data

Unstructured Data is data that does not have a fixed form or model. This data tends to be raw and often in the form of text, images, or videos. This type of data is more difficult to process and analyze due to the lack of a clear structure.
Examples of Unstructured Data:

  • Text: Emails, articles, blog posts, and other communications that are free text.
  • Social Media Engagement Data: Comments, likes, shares, and other interactions on social media platforms that often take the form of text, images, or videos without a consistent structure.
  • Multimedia Documents and Files: Word documents, PDFs, images, audio, and video that do not have structured metadata or a uniform format.

Big Data Examples


Big Data is a topic that is still and will continue to be discussed in the realm of technology. Almost all activities that we do in everyday life are related to Big Data. The following are examples of Big Data in everyday life

A government agency in Indonesia uses Big Data to help manage population databases efficiently. By collecting and analyzing population data from various sources such as birth certificates, identity cards and death certificates, a government agency can ensure the integrity and accuracy of population data.

The next example of Big Data is the application of Big Data in E-Commerce. In an E-Commerce Big Data is used to determine the results of a product recommendation in E-commerce. This recommendation is often used to buy an item, where big data analysis can evaluate how shopping history, consumer purchasing patterns, customer preferences, and browsing history to provide product recommendations to consumers.

The next example of Big Data is its application in the field of Internet of Things. Electronic devices such as sensors connected to IoT (Internet of Things) produce data output. This data can have a large volume so it is called Big Data.


Some other examples of Big Data that we often find include:

  • A movie recommendation system on a video streaming platform that will recommend a movie or show based on the user’s viewing patterns and history.
  • Sales analysis in large-scale retail stores that use large volumes of transaction data to optimize inventory and promotions.
  • Social media data monitoring by Digital marketing agencies to understand consumer sentiment towards their products or brands.
  • Analyze traffic patterns in mapping apps to predict travel times and best routes.


Characteristics of Big Data


Big data characteristics describe 5 elements or commonly called 5V. 5V in Big Data is a concept used to describe the main characteristics of big data. The characteristics of big data include Volume, Velocity, Variety, Veracity, and Value. We will try to take a case study at https://analysis.netray.id/meneropong-tahun-politik-2024-secara-pemberitaan-kicauan-warganet/ and determine how the characteristics of Big Data are in it


Volume (Amount)


Refers to the huge amount of data generated from various specific data sources such as social media, IoT sensors, online transactions, etc.


For example, Netray collected data on 700.5 thousand tweets from 79.8 thousand Twitter accounts from the keywords “2024 Presidential Election”, “2024 Presidential Candidate and 2024 Vice Presidential Candidate”, with 515.6 million impressions and reaching up to 318.8 million accounts.


The challenge is how to store and manage data at scale.


Variety (Diversity)


Describes the various types of data available, whether structured, semi-structured, or unstructured.


Examples:
Structured data: database tables.
Semi-structured data: JSON, XML.
Unstructured data: images, videos, text.


From the Netray case, the data collected by Netray has different data types ranging from text, images, and videos.


The challenge: how to integrate and process all these types of data.


Velocity (Speed)


Refers to the speed of data flow or how quickly data is generated, processed, and manipulated.


For example, Twitter data regarding discussions on “2024 Presidential Election”, “2024 Presidential Candidates and 2024 Vice Presidential Candidates” continues to grow in real time.


The challenge is how to ensure this fast-moving data remains relevant and accurate when analyzed.


Value (Nilai)


Refers to the benefits or value that can be taken from the data.


For example, from large and complex data, Netray has succeeded in producing information in the form of visualization results to understand the discourse on the 2024 Election, especially the 2024 Presidential Election. The results are in the form of insights or issues that are currently trending based on the keywords “2024 Presidential Election”, “2024 Presidential Candidate and 2024 Vice Presidential Candidate”.


The challenge is how to extract maximum value from big data.


Veracity (Truth/Accuracy)


Refers to the accuracy and reliability of data.


For example, the data taken from Twitter includes types of customer support chats, social media comments, and reviews, therefore Netray uses sentiment analysis techniques to analyze digital text into positive, negative, or neutral messages.


The challenge is ensuring that the data used is credible and relevant for analysis.

Big Data Life Cycle

Big Data Life Cycle is a series of systematic stages in processing Big Data, starting from data collection to utilization to produce new valuable insights. Big Data Life Cycle is widely used in various sectors, such as health, finance, business, education, and technology used for data-based decision making. According to sources from [1] Big Data Life Cycle is generally divided into several stages, namely

  1. Business Case Evaluation
  2. Data Indentification
  3. Data Acquisition & Filtering
  4. Data Extraction
  5. Data Validation & Cleansing
  6. Data Agregation & Representation
  7. Data Analysis
  8. Data Visualization
  9. Utilization of Analysis Results
Big Data Life Cycle
Big Data Life Cycle

The following is a brief explanation of the stages in the Big Data Life Cycle.

  • Business Case Evaluation : This stage is the evaluation stage to determine what the actual purpose of the data analysis to be carried out is.
  • Data Identification : This stage is useful for identifying various data sources to achieve the specified goals.
  • Data Acquisition & Filtering : This stage serves to collect data from sources. The collected data will then be filtered and the remaining relevant data will be used.
  • Data Extraction : This stage will take data from various sources into a specific format that can be analyzed further.
  • Data Validation & Cleansing : This stage serves to ensure that the data obtained is valid. The data will then be cleaned from various errors or inconsistencies.
  • Data Aggregation & Representation : This stage will combine data from various sources and present it in a format suitable for analysis.
  • Data Analysis : This stage is an important part of the Big Data Life Cycle where analysis will be carried out on the data that has been prepared to obtain new insights and insights from the data.
  • Data Visualization : This stage functions to display the results of data analysis in the form of graphs or visuals so that they are easy to interpret and understand.
  • Utilization of Analysis Results : Big Data analysis results will be used to support decision making or business strategies.

Big Data Life Cycle Implementation Examples

The author tries to conduct a Big Data Life Cycle analysis on topic [2] regarding Analysis of Patterns and Trends in COVID-19 Research . The author tries to analyze what stages of the Big Data Life Cycle are carried out in the research.

Business Case Evaluation

Techniques such as BERT, TF-IDF, and LDA are used to analyze scientific literature related to COVID-19. The results of this analysis can provide insights into research trends, key topics being studied, patterns and trends in scientific articles related to COVID-19 , and the general sentiment associated with these topics. This information can help businesses identify opportunities for innovation, new strategies, or further research that can support their response to the pandemic.

Data Indentification

This study utilized a database of research abstracts on COVID-19. The primary data source came from Kaggle , a repository provided by various leading institutions and organizations . Furthermore, data processing and pre-processing were carried out using various techniques and tools, such as stop words removal, and the use of natural language processing algorithms such as NLTK.

Using this dataset and appropriate analysis techniques, this study aims to uncover patterns and trends in scientific articles related to COVID-19, as well as to conduct sentiment analysis on emerging research topics.

Data Acquisition

Data acquisition is the process of collecting raw data from its source (data collection)

  1. Data was accessed through Kaggle, a platform that provides access to a variety of open datasets from a variety of sources, including AI2, CZI, MSR, Georgetown, NIH, and The White House.
  2. The relevant dataset for this project is a collection of metadata from scientific articles related to COVID-19, SARS-CoV-2, and other related coronaviruses from approximately 253,545 articles , with the majority of articles published in the last two years and spanning 17 different languages.
  3. Data is downloaded or accessed via the Kaggle API as per project requirements.

Data Filtering

This stage falls under the “pre-processing” category and involves the process of checking data for errors, incompleteness, or inconsistencies.

  1. Data downloaded from Kaggle will most likely be in CSV format or in another format that can be processed.
  2. The data filtering process is carried out to remove irrelevant data and prepare the data for further analysis. The steps taken include:
    1. Removal of Irrelevant Articles: Articles with topics unrelated to COVID-19, SARS-CoV-2, or other coronaviruses were removed from the dataset. Articles that did not include words such as “corona,” “sars,” and “covid” in the title were also removed.
    2. Abstract Selection: Article abstracts were used as the source text for the analysis. Therefore, articles that did not have an abstract were removed from the dataset.
    3. Removal of Non-English Articles: Articles in languages ​​other than English are removed to facilitate further analysis and processing.
    4. Removal of Articles with Publication Years Before 2019: The focus of the analysis was on research related to the COVID-19 pandemic. Therefore, articles with publication years before 2019 were deemed irrelevant and removed from the dataset.
    5. Abstract Text Cleaning: The abstract text is converted to lowercase, punctuation, numbers, and special characters are removed, and stop words are removed to clean the data and prepare it for NLP analysis.
    6. Data processing from TF-IDF and LDA analysis: abstract text is processed by tokenization, stemming, and lemmatization. Frequently occurring words are removed to avoid bias in the resulting topics, focusing on the use of words that are specific and relevant to the research topic.

The results of this pre-processing produced 57,921 final articles by focusing on articles that were highly relevant to COVID-19.

Data Extraction

At this stage there is an Exploratory Data Analysis stage to understand the main characteristics of the data set. An exploratory analysis was conducted on the abstracts of a collection of articles consisting of 57,921 articles. Descriptive statistics show that the average abstract has a length of 198 words and 1164.3 characters with an average word length per abstract of 5.9 characters

  1. Data extraction was performed using Term Frequency-Inverse Document Frequency (TF-IDF) , pre-trained models, specifically BERT (Bidirectional Encoder Representations from Transformers) and DistilBERT.
    • Identifying the most relevant words in the data
    • For example, some words with high TF-IDF scores include ‘patient’, ‘infect’, ‘disease’, and ‘pandemic’.
    • High frequency words appear in 25%-50% of articles, medium frequency in 10%-25%, and low frequency in 0.01%-1% of articles.
  2. Data extraction was performed using the LDA Model
    • Identifying important themes in the literature related to health and medical research
    • LDA uses 2 models
    • The five-topic LDA model provides more differentiated insights and appears to be more useful than the four-topic model, as the five-topic model can distinguish more distinct subject areas. For example, in the five-topic model, the topics focus on health, research, treatment, genetics, and health statistics.
    • A process called tokenization occurs and the text is transformed into a numeric vector with 768 elements.
    • Furthermore, to overcome the challenge of high dimensionality, the Uniform Manifold Approximation and Projection (UMAP) algorithm is used to reduce the data dimension from 768 to only 10 dimensions.
    • Next, there is a clustering process, using the HDBSCAN algorithm to produce topic analysis from a large collection of text documents.
  3. BERT is a pre-trained language model capable of generating vector representations for input text. The model is trained on large unlabeled data, so it can capture relationships between words in a broader context.
    • A process called tokenization occurs and the text is transformed into a numeric vector with 768 elements.
    • Furthermore, to overcome the challenge of high dimensionality, the Uniform Manifold Approximation and Projection (UMAP) algorithm is used to reduce the data dimension from 768 to only 10 dimensions.
    • Next, there is a clustering process, using the HDBSCAN algorithm to produce topic analysis from a large collection of text documents.

Data Validation & Cleansing

Two medical students performed qualitative validation , testing three cluster sizes: 60, 30, and 15 , to find the optimal size using the HDBSCAN algorithm. The results showed that cluster size 15 provided the highest interpretability and usability, although there was higher disagreement about the usability of larger clusters. This study emphasizes the importance of adjusting cluster size to achieve a balance between analytical depth and interpretative consensus.

Data Analysis

  1. Text Processing and Natural Language Modeling (NLP)
  • Using NLP techniques to analyze abstract text from scientific articles about COVID-19.
  • This includes processes such as tokenization, text cleaning, and feature extraction such as TF-IDF.

2. Use of Pre-trained Language Models (BERT and DistilBERT)

  • Utilization of pre-trained language models such as BERT (Bidirectional Encoder Representations from Transformers) and DistilBERT to generate vector representations of abstract text.
  • This vector representation is used for advanced analysis such as clustering and topic identification.

3. Use of Clustering and Topic Modeling (LDA) Algorithms:

  • Application of clustering methods such as HDBSCAN to group articles based on similarities in abstract text.
  • Topic modeling such as LDA (Latent Dirichlet Allocation) is used to identify key topics in scientific literature.

4. Sentiment Analysis and Data Visualization:

  • The use of sentiment analysis techniques to understand the opinions and sentiments contained in scientific articles.
  • Use of data visualization to present insights found in COVID-19 datasets.

Data Visualization

  1. BERT cluster visualization: UMAP projection is used to visualize cluster relationships in the data. providing a visual understanding of topic distribution.
  2. Sentiment Analysis: Using textBlob to analyze sentiment polarity from abstracts. Data shows a dominance of neutral sentiment across the corpus and certain groups.

Also Read: What is OSPF

The Relationship Between Business Process Management (BPM) and Big Data


Business Process Management (BPM) is a systematic approach that functions to design, run, monitor, and optimize business processes in an organization with the aim of increasing efficiency. In Big Data Analytics, processing large volumes of data can facilitate the management of business processes currently running in a company (Business Process Management).


The role of Big Data Analytics is important to find patterns and correlations between owned data so that it will produce new insights or insights related to the data. Data Analytics is able to find out how consumer behavior, predict market trends or detect misuse or fraud in information processing.


The use of big data analytics for Business Process Management (BPM) can support the company’s core business such as

  • Process optimization
  • Business result projections
  • Real-time monitoring
  • Customer insight
  • Risk management

Also Read: Hub is


Big Data Analytics in Marketing Business

The use of big data analytics in marketing service businesses can open up various opportunities to optimize existing processes. Some processes that can be optimized through big data analysis in this type of business include:


Sentiment Analysis

Big data analytics can be used to monitor and analyze customer sentiment towards brands on social media and other online platforms. This helps understand customer perceptions about your brand and respond quickly to the feedback given.

Content Performance

Through big data analysis, you can understand the performance of your marketing content across different channels and platforms. This allows you to optimize your marketing budget allocation and focus on the channels that deliver the best results.

Campaign Performance Measurement

Big data analytics allows you to track and measure the performance of your marketing campaigns in greater detail. You can view metrics such as conversion rates, engagement rates, and ROI to evaluate the effectiveness of your campaigns and make adjustments as needed.

Predictive Modeling

By leveraging big data analytics, you can perform predictive modeling to predict future customer behavior. This allows you to identify new sales opportunities, estimate customer value, and optimize your marketing strategy.

Sentiment Analysis

Big data analytics can be used to monitor and analyze customer sentiment towards your brand on social media and other online platforms. This helps you understand customer perceptions about your brand and respond quickly to the feedback given.

Conclusion


Big Data is a large and complex collection of data that enables more effective, informed decision-making. Through stages of analysis such as data cleaning, processing, and visualization, Big Data helps individuals and organizations understand data and generate valuable insights. With broad applications, from e-commerce to the Internet of Things (IoT), Big Data is a key pillar in digital transformation.

Optimize your business decisions with Big Data-based solutions. Learn more about how this technology can help your business thrive in the digital era with Telkom University

Reference

[1] Erl, T., Khattak, W., & Buhler, P. (2016). Big data fundamentals: Concepts, Drivers & Techniques. Pearson.

[2] Dornick, C., Kumar, A., Seidenberger, S., Seidle, E., & Mukherjee, P. (2021). Analysis of patterns and trends in COVID-19 research. Procedia Computer Science, 185, 302–310. https://doi.org/10.1016/j.procs.2021.05.032

Author: Meilina Eka A

meilinaeka
meilinaeka

Meilina Eka Ayuningtyas is building her career in Information Technology, Digital Marketing, and Data Analytics. With an educational background in Telecommunication Technology, Meilina combines technical expertise with digital marketing strategies to support business growth and enhance online visibility across various industries.

Articles: 643

Leave a Reply

Your email address will not be published. Required fields are marked *

Discover more from Direktorat Pusat Teknologi Informasi

Subscribe now to keep reading and get access to the full archive.

Continue reading