Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with, by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity.When we handle big data, we may not sample but simply observe and track what happens. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.
These characteristics were first identified by Doug Laney, then an analyst at Meta Group Inc., in a report published in 2001; Gartner further popularized them after it acquired Meta Group in 2005. More recently, several other Vs have been added to different descriptions of big data, including veracity, value and variability.
Although big data doesn’t equate to any specific volume of data, big data deployments often involve terabytes (TB), petabytes (PB) and even exabytes (EB) of data captured over time.
Big data comes from myriad different sources, such as business transaction systems, customer databases, medical records, internet click stream logs, mobile applications, social networks, scientific research repositories, machine-generated data and real-time data sensors used in internet of things (IoT) environments. The data may be left in its raw form in big data systems or preprocessed using data mining tools or data preparation software so it’s ready for particular analytics uses.