Big Data
The term big data refers to a mass of computer data that is large, complex, and produced at such a high speed. This amount of data is practically impossible to process with traditional data management and analysis tools. Accessing and storing large amounts of data for analysis has been available for a long time, but big data has posed a new challenge in data analysis precisely because it is characterized by such enormous quantities: it is collected and produced in various formats and at an impressive speed.
Characteristics of big data
Big data was initially characterized in the literature by the "three Vs": volume, variety, and velocity. Alongside these three characteristics, two more Vs have been added: veracity and value. The following are the various characteristics that constitute the "five Vs" model.
Volume
The term volume refers to the quantity of data generated and stored by big data. The size of the data determines the value and potential hidden in the data, and also determines the very characterization of big data for that particular data set. The size of such a data set is generally in the order of terabytes and petabytes.
Variety
This refers to the type and nature of the data, which can be semi-structured or unstructured at all. Data management technologies such as relational databases are capable of managing structured data efficiently and effectively, but the changing nature of data has challenged the use of such technologies for big data and has led to the advent of new, specifically developed technologies. In these technologies, data is available in all types of formats, from structured and numeric data in traditional databases to unstructured text documents, emails, videos, audio, and financial transactions.
Velocity
This refers to the speed with which data is generated and processed to meet requirements, which can sometimes also involve real-time availability. Compared to small data, big data is produced continuously. Two types of metrics related to big data are generation rate and update rate. Generation rate refers to the frequency with which data is produced or generated. Update frequency, on the other hand, refers to the frequency with which data is updated or modified. Update frequency is important for understanding how frequently data changes and how quickly it must be processed and analyzed to obtain up-to-date information.
Veracity
This term refers to the quality of data. Because data comes from so many different sources, it is difficult to connect, match, cleanse, and transform data across the different systems that collect or manipulate it. Companies need to connect and correlate relationships and hierarchies between data; otherwise, the data management process could quickly spiral out of control.
Value
This refers to the value of information that can be obtained from processing and analyzing large data sets. Value can be It can be measured by an evaluation of other qualities of big data or it can be represented by the profitability of the information retrieved from the analysis process.