In this final chapter, we will cover the following contents.
- Aggregating data
- Grouping data
- Joining different files
- Writing data to Files
- Becoming Pandas Master from here.
This chapter is highly important, Let’s take a deep breath and start.
If you want to aggregate the data using one or more data over a specified axis, Pandas have an agg() method for that. Let’s understand using the example of the iris dataset.
Suppose we have a requirement to aggregate the data on the basis of sum, minimum value or maximum value of data columns. The below code of agg() will help to achieve that.
You must be wondering, whats use of this function, why can’t we use describe method for the count, min, max, etc. You are absolutely correct, you can use. But in aggregation, there are tons of default methods which you can use. You can even create your custom aggregation function just like count, sum. Let’s see an example.
Custom Aggregation Functions in Pandas
The group by operation involves splitting the data, applying a function and combining the results.
The above diagram shows the splitting of the input data in groups, then applying the function on each of the group, then combining the results. There are classes from 1 to 4, having different marks. The splitting step involves splitting of each class into four groups i.e. class 1, class 2, … class 4. Then applying function, in the example we applied mean function. The final step involves combining the records.
Hands on Example with iris dataset.
Bonus: Using Custom Function in group by
Joining different files
In real-world problems, the data is rarely a single file. Generally, it is present in different files. We merge all files into a single unit based on our need, after that we continue our analysis.
In this section we will learn how to join different files in pandas.
Before divings, lets see the different types of joins:
- Inner Join/Inner Merge: In the inner merge, we keep the rows which are common between the left dataframe and right data frame.
- Left Join/Left Merge: In this merge, we keep the rows of the left dataframe, where there are no matching records in the right dataframe they got replaced by NAN.
- Right Join/Right Merge: In this merge, we keep the rows of the right dataframe, where there are no matching records in the left dataframe they got replaced by NAN.
- Full Outer Join/Outer Merge: In this merge, we keep rows of both dataframes, where there are no matching records, they got replaced by NAN.
You need to mention:
- Left Dataframe
- Right Dataframe
- On -> The column which is common in both dataframe, sometimes the column names of left dataframe and right dataframe are different in that case use. left_on for the left dataframe column name and right_on for the right dataframe column name.
- how -> Specify which join.
Deep Diving to Joins:
See in the example below, if we have a left join, we will keep all the rows of the left dataframe. If no respective records are found in right dataframe replace the same with NAN. Just like we have Class_ID as 1, there are no matching records in Right dataframe hence it will be replaced with NAN. Also, we will discard the right non-matched records, like Class ID 10 and 11.
See in the example below, if we have a right join, we will keep all the rows of the right dataframe. If no respective records are found in left dataframe replace the same with NAN. Just like we have Class_ID as 10,11, there are no matching records in left dataframe hence it will be replaced with NAN. Also, we will discard the right non-matched records, like Class ID 1.
In the example below, if we have an outer join, we will keep all the rows of the right dataframe and left dataframe. If no respective records are found in any of the dataframe replace the same with NAN.
Inner Merge (Default)
If we don’t specify how parameter the panda’s default will use inner merge, which will only consider the common records of both dataframes based on “on” parameter. Please see the image below:
Writing data to Files
Whether you are doing any data science project or participating in a kaggle competition, the writing of the data to the files is extremely important. There are plenty of options available in pandas for writing files to different extensions csv, excel, json, etc. Just write dataframe.to_csv for CSV to_excel for excel, etc.
Example for writing data to a JSON Format
Becoming Pandas Master from here:
Congrats! You have covered pandas and now you have an in-depth understanding of it. But this is not the end, mastering pandas involves an immense hands on. Since you are ready for the hands-on session.
Please find the below link of hands-on practicing pandas for machine learning
If you have any queries or suggestions. Please drop a comment below. Will happy to help you :).