Data Science

What is Data

Data

Data is a collection of information. It can take a variety of forms, is collected from multiple sources, and is all around us. Often, it is purposely translated and processed in a way that provides a more efficient comprehension – easier for the human eye to understand as well as apply certain analyses in order to extract valuable insights.

Data Categorisation

There are two categories of data, structured and unstructured – both differ in how they are collected and scaled. Often, the former is referred as quantitative data and the latter as qualitative data.

Structured data is clearly defined and organised, making it easy for users to manipulate and search through. The easiness of structured data offers a quick and easy understanding of the data to the users, as well as smooth application of machine learning algorithms onto the data.

Examples of structured data: phones numbers, dates, customer names, addresses, transaction information, credit card numbers, and so on.
Examples of applications of structured data: invoicing systems, sales transactions, contact lists, customer relationship management (CRM), product databases, online booking systems, and so on.

Unstructured data can be categorised as “anything else”. This data category is not easily processed and analysed as compared to structured data, making it more difficult for manipulation. Unstructured data offers a quick and easy accumulation (collection).

Examples of unstructured data: text files, emails, social media data (posts for instance), audios, videos, mobile activities, imagines, and many more.
Examples of applications of unstructured data: word processing, editing media systems, chatbots, predictive data analytical tools, email clients, presentation software, etc.

Data Types

There are many data types, with each data type containing different information and allowing different operations on it. The following are some of the most commonly used data types within the data science community:

BooleanTrue or False
Numericinteger (int), float, complex
Textstring (str)
Unionfloat or long integer
Binarybit, byte
Characterschar (‘A’ or ‘C’ or ‘4’)
Datesdd/mm/yyyy
Nonenull or none value
OtherDictionaries, Lists, Sets, Tuples

Data Formats

Data format refers to the way data is stored. Information can come in a variety of formats, and the formats available nowadays are endless. Some formats are designed for specific data types and others encompass multiple data types, and most of them require a particular software in order to access.

The following are some of the most commonly used data science data formats:

HTMLUsed for creation of web pages
PDFEasy access and exchange, can contain text and imagines and other elements
XLSXExtension file used for Microsoft Excel spreadsheets
JSONText format for storing and transmitting data, easy to understand
ZIPArchive file format supporting data compression
CSVText file, uses commas to separate values
DatabasesSystems used to store data
XMLText file, uses tags to define the structure of the information

Next: Data Preparation