This article discusses big data characteristics, tools, and challenges. Big data is a term excessively being used to describe complex and huge volumes of data, both structured and unstructured that is beyond the capabilities of the the traditional computing and data processing tools and techniques. Currently, the rate of data generation has excessively increased particularly with the advent of smart hand held devices Integrated with the Internet of Things (IoT) environments and social media platforms, the number of data processing devices has grown over 20 billion and this number is expected to reach 50 billion by the year 2030. Therefore, effective analysis of this big data to derive the meaningful and useful relationships between this data and to create the real value for the enterprises has in fact become a challenge. This useful information can create a competitive edge for the organizations and can help them re-strategize their business actions and decisions.
Big Data Characteristics
We see that the data recently being generated is not of a particular type; instead, the data belongs to diverse categories, for example text, images, video, audio which is integrated with the data streams coming from different sources for example, business applications, finance, education, industry, weather, stock markets, healthcare, banking sector, social media and Web etc. Interestingly, the scientific and engineering community and research has also considered the rapidly generating volumes of data as an opportunity to come up with the novel methodologies. This not only helps in efficient processing but also provides the deeper insights through advanced analytics to uncover the patterns and meaningful relationships in more innovative way.
Big data analytics helps in better decision making with the help of data mining, machine learning, and artificial intelligence techniques. These techniques have effectively been applied in different domains. For example, the businesses utilize big data to increase their sales by understanding the consumer behaviour whereas credit card companies use analytics to detect the credit card frauds. Similarly, in healthcare domain the big data is used for disease diagnosis and understanding the relationships among multiple contributing disease variables. The characteristics of the term “big data” which are also called “Big Data Vs” are explained below.
Volume refers to the large amount of data generated from different sources, such as social media platforms, metrological data, industrial data etc. For example, Facebook generates daily around four petabytes of data. Imagine this is the volume of data generated by just one social network and if we consider the data generated by all other social media networks, Web applications, and from other sources, the volume of all the data becomes extremely high to be processed by the convention tools and methodologies.
Variety refers to different types of data, for example text, audio, video, images, log files, GPS collected or generated through different sources, including sensors, social networks, smart phones and other devices.
Velocity refers to the of speed of data generation, information flowing into and out of systems, real-time and streamed data arriving from multiple sources. The rate at which data is generated also demands for the efficient tools and methods to efficiently process this data.
Veracity refers to the quality and truth in the data. Veracity like other big data characteristics, such as volume, variety, and velocity is significantly important because if the data quality is not high, it is difficult to make accurate inferences from such data.
Value refers to realizing the true potential of bigdata analytics to create the real benefits and value for the organizations.
Big Data Challenges
The processing, analysis, and storage of huge volumes of data being constantly generated from diverse sources entails some serious challenges for the enterprises. These challenges include:
Security of the data that is generated is in fact a principal concern that despite the continuous preventive measures is an open issue. Since it is impossible for the enterprises to store large volumes of big data locally, therefore, they choose to store the data in the third-party cloud servers. However, this way the organizations have limited control and visibility of their data stored in the cloud, therefore, this gives rise to critical security threats and apprehensions for the consumer organizations.
Due to the rapid increase in data, the storage of the big data in physical mediums has also emerged as one of the issues. Since big data tools emphasize on processing the large amounts of data in reasonable time, therefore, utilizing cloud computing and other scalable storage systems is highly recommended. Another problem associated with the storage is the cost to store data. Handing enormous amounts of data is difficult for the organizations themselves as many of them may not afford to establish their own data centres. In that case organizations may opt to utilize the external storage from some provider but again this has cost and that too is recurring. Therefore, the cost of storage of big data is still an issue that needs the attention by the experts.
Appropriate model selection to analyse large-scale data is critical. Efficient analysis algorithms that produce timely results are required to obtain useful information from huge volumes of data. In terms of data analysis, the capabilities of the current tools and approaches to integrate the data streams from multiple sources are still limited. The reason for the issue is the heterogeneous types of data; consequently, the integration and the subsequent curation of data becomes complicated, which ultimately makes the analysis difficult.
Data Curation Issues
Since the data from multiple sources is growing continuously; this gives rise to issue of data heterogeneity because the data originated from different sources has different representations. For example, the data from the social networks and other streaming sources is in different formats and is mostly unstructured. Therefore, transformation, annotation and curation of the data to make it effective and usable for further analysis is a problem that needs attention by the research community.
Big data tools
There are several tools for managing the big data. A few popular tools are listed below.
- Apache Spark
- Apache Storm
The way enterprises have exhibited their inclination toward big data tools and techniques to add values to their businesses, it is anticipated that sooner it will revolutionize the society by providing data driven support for strategic decisions.
Read more about big data tools.