Making data easier to read, preprocessing and removing the noisy data is the Data Scientists day to day tasks. Pandas is the open source library used by Machine Learning people for Data Analysis and Manipulation
If you are starting your machine learning journey. You will come across the buzzword called Pandas. So I will explain you the complete Pandas from Beginning to End.
Contents of Chapter 1:
- Why we need Pandas Library.
- Introduction to Data frames and Series
- Different ways to Import a Dataset in Pandas
1. Why we need Pandas Library:
The Initial steps of the machine learning is to gather the data, then we need to prepare the data. So, in order to perform the Data analysis and manipulation easier we need Pandas. Internally pandas library is build on the top of Numpy and Matplotlib.
When we import the data from the different sources, we may need to join them together into a single place, do some statistical data analysis and dealing with the missing or noisy data. Pandas can do it all for you, the library is pretty helpful.
Importing the library:
import pandas as pd
2. Introduction to Data frame and Series.
Before we practically deep dive into Pandas, lets understand the data structures Dataframe and Series.
Series is one dimensional array holding any one data type i.e. int, string, float, Python objects etc.
Syntax of Series:
series = pd.Series(data= YOUR_DATA , index= INDEX)
The index plays an important role as it is the axis labels of data. Length of “data” should be equivalent to the length of index. Note: Its okay, if you don’t specify the index, in such case pandas will create an automatic index for you having values [0,1,2,3 …. N], where N is the length of the data.
You can specify the Series data and index individually using list , or you can specify the python dict which has key value pairs, key will represent index and value will represent the data values.
Way 1 : (Series Created with Index and Data)
Way 2 : (Series Created with dict having key as index and values as data points)
Way 3: (Series without index)
Note: You can try out adding two series, you can see the elements having similar index will get added.
Example that you can try out:
Having a question? What if we add two series which differ in indexes. Let’s try out.
Since we all indexes are different, hence the result will produce a null values.
Dataframe is 2 dimensional labelled data structure with columns of different data types. You can think of a spreadsheet with columns and rows. Each column can hold different data type. We can also say the Dataframes are collection of series.
As you can clearly see we have passed a collection of series to data frame specifying column names.
Now you are done with the basic data structures of Pandas.
Before we head towards importing Dataset in pandas, We have a head function in dataframe
df.head() , which helps in returning the top 5 rows of the dataframe. You can alter this “5” number say 10, you use
3. Different ways to Import a Dataset in Pandas
Since we have completed the basics, In the real world data we have to read the data of the various formats. So, now we will learn how to import various formats data to a pandas Data Frame.
Just a recap, we have series which has 1 dimension data and Dataframe has 2 dimensional labelled data with columns.
Importing a CSV File.
df = pd.read_csv("https://URLGoesHere")
Importing an Excel File.
df = pd.read_excel("https://remote_url")
Similarly we can import the data of various formats. Different functions are available in pandas such as:
read_clipboard, read_feather, read_html, read_json, read_sas, read_sql, read_table etc.
Congrats ! You have covered first chapter of pandas series.
If you have any comments and suggestions, Please drop a comment below.