What is Data
Data
Data is a collection of information. It can take a variety of forms, is collected from multiple sources, and is all around us. Often, it is purposely translated and processed in a way that provides a more efficient comprehension – easier for the human eye to understand as well as apply certain analyses in order to extract valuable insights.
Data Categorisation
There are two categories of data, structured and unstructured – both differ in how they are collected and scaled. Often, the former is referred as quantitative data and the latter as qualitative data.
Structured data is clearly defined and organised, making it easy for users to manipulate and search through. The easiness of structured data offers a quick and easy understanding of the data to the users, as well as smooth application of machine learning algorithms onto the data.
Examples of structured data: phones numbers, dates, customer names, addresses, transaction information, credit card numbers, and so on.
Examples of applications of structured data: invoicing systems, sales transactions, contact lists, customer relationship management (CRM), product databases, online booking systems, and so on.
Unstructured data can be categorised as “anything else”. This data category is not easily processed and analysed as compared to structured data, making it more difficult for manipulation. Unstructured data offers a quick and easy accumulation (collection).
Examples of unstructured data: text files, emails, social media data (posts for instance), audios, videos, mobile activities, imagines, and many more.
Examples of applications of unstructured data: word processing, editing media systems, chatbots, predictive data analytical tools, email clients, presentation software, etc.
Data Types
There are many data types, with each data type containing different information and allowing different operations on it. The following are some of the most commonly used data types within the data science community:
Boolean | True or False |
Numeric | integer (int), float, complex |
Text | string (str) |
Union | float or long integer |
Binary | bit, byte |
Characters | char (‘A’ or ‘C’ or ‘4’) |
Dates | dd/mm/yyyy |
None | null or none value |
Other | Dictionaries, Lists, Sets, Tuples |
Data Formats
Data format refers to the way data is stored. Information can come in a variety of formats, and the formats available nowadays are endless. Some formats are designed for specific data types and others encompass multiple data types, and most of them require a particular software in order to access.
The following are some of the most commonly used data science data formats:
HTML | Used for creation of web pages |
Easy access and exchange, can contain text and imagines and other elements | |
XLSX | Extension file used for Microsoft Excel spreadsheets |
JSON | Text format for storing and transmitting data, easy to understand |
ZIP | Archive file format supporting data compression |
CSV | Text file, uses commas to separate values |
Databases | Systems used to store data |
XML | Text file, uses tags to define the structure of the information |
Next: Data Preparation