Pandas Unleashed: Embarking on a Data Journey - II
In my last article, Pandas Unleashed: Embarking on a Data Journey, you have read about Pandas and some basic methods for data analysis. Now let’s move forward and learn more about it. Let's move forward and learn.
Data Cleaning
Drop:
The drop()
method in Pandas is used to remove rows or columns from a DataFrame based on specified labels (indexes or column names). It allows you to eliminate unwanted data or missing data from your DataFrame, effectively reducing its size or eliminating specific observations or features. We can pass down labels, axis, index, columns, level, inplace, errors
arguments.
1import pandas as pd
2
3data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
4df = pd.DataFrame(data)
5
6# Drop the row with label 1
7df_dropped = df.drop(1)
8
9# Drop the column B
10df_dropped = df.drop('B', axis=1)
Dropna:
The dropna()
method in Pandas is used to remove rows or columns with missing (NaN) values from a DataFrame. Handling missing data is a critical part of data preprocessing, and dropna()
is one of the methods available for this purpose. This method can take axis, how, subset, thresh, inplace
arguments.
1import pandas as pd
2import numpy as np
3
4data = {'A': [1, 2, np.nan, 4],
5 'B': [5, np.nan, 7, 8]}
6df = pd.DataFrame(data)
7
8# Drop rows with any NaN values (default behavior)
9cleaned_df = df.dropna()
Fillna:
The fillna()
method in Pandas is used to fill missing (NaN) values in a DataFrame or Series with specified values. This method is handy for data cleaning and preparation because it allows you to handle missing data in various ways, such as filling with a constant value, filling with a calculated value, or forward/backward filling. We can pass down value, method, axis, inplace, limit, downcast
arguments to fillna
.
1import pandas as pd
2import numpy as np
3
4data = {'A': [1, 2, np.nan, 4, 5],
5 'B': [np.nan, 2, 3, np.nan, 5]}
6
7df = pd.DataFrame(data)
8
9# Fill missing values with 0
10df_filled = df.fillna(6)
Duplicated:
The duplicated()
method in Pandas is used to identify duplicate rows in a DataFrame. It helps you find rows that have identical values across all columns or a specified subset of columns. This method returns a Boolean Series where each entry indicates whether the corresponding row is a duplicate or not. This takes only two arguments subset ,keep.
1import pandas as pd
2
3data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
4 'Age': [25, 30, 25, 35, 30]}
5
6df = pd.DataFrame(data)
7
8# Check for duplicate rows across all columns
9df.duplicated()
Drop Duplicates:
The drop_duplicates()
method in Pandas is used to remove duplicate rows from a DataFrame. Duplicate rows are rows that have identical values in all columns or a subset of columns, depending on how you specify the subset parameter. This method is helpful for data cleaning and ensuring that your dataset contains only unique records. The four arguments it can take are subset, keep, inplace and ignore_index
.
1import pandas as pd
2
3# Creating a DataFrame with duplicate rows
4data = {'A': [1, 2, 2, 3, 4],
5 'B': ['x', 'y', 'y', 'z', 'x']}
6df = pd.DataFrame(data)
7
8# Removing duplicates based on both columns 'A' and 'B'
9df_cleaned = df.drop_duplicates()
10
11# Display the cleaned DataFrame
12print("\nDataFrame after removing duplicates:")
13print(df_cleaned)
Replace:
The replace()
method is used to replace values within a DataFrame or Series with other values. It is particularly useful when you need to replace specific values with new values, either based on a condition or by providing a mapping dictionary. It can take to_replace, value, limit, regex, method
and inplace
arguments.
1import pandas as pd
2
3data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
4 'Grade': ['A', 'B', 'C', 'B', 'A']}
5df = pd.DataFrame(data)
6
7# Replace 'A' with 'Excellent' and 'B' with 'Good' and 'C' with 'Fair'
8df['Grade'].replace({'A': 'Excellent', 'B': 'Good', 'C':'Fair'}, inplace=True)
9
10print(df)
Rename:
The rename()
method is used to rename columns or index labels in a DataFrame or Series. It allows you to change the names of one or more columns or index levels. It can take mapper, index, columns, axis, inplace, copy, level
and errors
arguments.
1import pandas as pd
2
3data = {'First Name': ['Alice', 'Bob', 'Charlie'],
4 'Last Name': ['Smith', 'Johnson', 'Brown']}
5df = pd.DataFrame(data)
6
7# Rename 'First Name' to 'First' and 'Last Name' to 'Last'
8df.rename(columns={'First Name': 'First', 'Last Name': 'Last'}, inplace=True)
9
10print(df)
Data Sorting
Sort Values:
The sort_values()
method is used to sort a DataFrame by one or more columns. You can specify the column(s) to sort by and the sorting order (ascending or descending). We can provide by, axis, ascending, inplace, kind
and na_position
as arguments.
1import pandas as pd
2
3# Create a DataFrame
4data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
5 'Age': [30, 25, 35, 28]}
6df = pd.DataFrame(data)
7
8# Sort by 'Age' column in ascending order
9df_sorted = df.sort_values(by='Age', ascending=True)
10
11# Sort by 'Age' column in descending order
12df_sorted_desc = df.sort_values(by='Age', ascending=False)
Sort Index:
The sort_index()
method is very much similart to `sort_values` method but it sorts a DataFrame or Series based on its index instead of values. We can provide axis, level, ascending, inplace, kind, na_position, sort_remaining, ignore_index
and key
as arguments.
1import pandas as pd
2
3# Create a DataFrame
4data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
5 'Age': [30, 25, 35, 28]}
6df = pd.DataFrame(data)
7
8# Sorting a DataFrame by index in ascending order
9df_sorted_index = df.sort_index()
10
11# Sorting a Series by index in descending order
12series_sorted_index_desc = df['Age'].sort_index(ascending=False)
Data Selecting & Filtering
A dataframe is made up of many columns and rows. We work with all the data in but sometimes it is necessary to work on subset of columns, rows or take out a small portion of data from our whole dataset. These all can be done using simple selection and filtering.
Selecting Columns:
Selecting columns out of dataframe is quite simple. The most important part for columns selection is the name of column/columns you are going to select.
1import pandas as pd
2
3data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
4df = pd.DataFrame(data)
5
6# Select one column
7df['A']
8
9# Select multiple columns
10df[['B', 'C']]
Selecting Rows:
Selecting rows from dataframe is similar to slicing in python. You just have to select/slice the number out of dataframe. We can also use loc
& iloc
.
1import pandas as pd
2
3data = {'A': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
4df = pd.DataFrame(data)
5
6# Selecting rows from 1 to 3
7df[1:4]
Loc & ILoc:
The pandas loc
method is able to access a group of rows and columns by label(s) or a boolean array in the given dataframe. It has no parameters but it will take row and column values to take the data out of it.
The iloc
method is almost same as the loc method but the main difference is loc method filters based on value and iloc
method filter based on index position.
1import pandas as pd
2
3employees = [('Stuti', 28),
4 ('Saumya', 32),
5 ('Aaditya', 25),
6 ('Saumya', 32),
7 ('Saumya', 32),
8 ('Saumya', 32),
9 ('Aaditya', 40),
10 ('Seema', 32)
11 ]
12
13df = pd.DataFrame(employees, columns =['Name', 'Age'])
14
15df.set_index("Name",inplace = True)
16
17# To select multiple rows using loc
18df.loc[["Stuti", "Seema"]]
19
20# To select single row using iloc
21df.iloc[4]
Filtering Data:
Data filtering is one of the important task in data analysis. It can sometimes lead to the better insights and can help to find out where the issues are or which area is to look out for. We can filter out data using simple data selection or using Query
method.
1import pandas as pd
2
3employees = [('Stuti', 28),
4 ('Saumya', 32),
5 ('Aaditya', 25),
6 ('Saumya', 32),
7 ('Saumya', 32),
8 ('Saumya', 32),
9 ('Aaditya', 40),
10 ('Seema', 32)
11 ]
12
13df = pd.DataFrame(employees, columns =['Name', 'Age'])
14
15df.set_index("Name", inplace = True)
16
17# Filter out employees whose age is less than 30.
18df[df['Age' < 30]]
Query:
Pandas query method let you query your data based on a condtion/conditions you provide. This method will only work if your column value has no blank space. The expr
argument is mandatory. Using the above data, let's use query method.
1# Filter employees whose age is greater than 30.
2df.query('Age > 30')
Data Manipulation
Adding New Column:
Bracket Notation - Bracket Notation is one of the most used way while working with pandas and data analysis. We can also use it in adding a new column in the exsisting dataframe.
1import pandas as pd
2
3data = {'Name': ['Alice', 'Bob', 'Aira', 'Charlie', 'Brian'],
4 'Age': [25, 30, 25, 35, 32]}
5
6df = pd.DataFrame(data)
7
8# Inserting City column
9df['City'] = ['Delhi', 'Mumbai', 'Gujrat', 'Newyork', 'Miami']
Insert - The insert method not just insert a new column in the dataframe. It also give us power to insert the columns at a particular You can insert a new column at a specific location in the DataFrame using the .insert()
method.
1import pandas as pd
2
3data = {'Name': ['Alice', 'Bob', 'Aira', 'Charlie', 'Brian'],
4 'Age': [25, 30, 25, 35, 32]}
5
6df = pd.DataFrame(data)
7
8# Inserting City Column at Index 1
9df.insert(loc=2, column='City', value=['Delhi', 'Mumbai', 'Gujrat', 'Newyork', 'Miami'])
Assign - The assign method is similar to bracket notation we used. It adds new column but it also creates a whole new dataframe rather than doing any kind of modification in the old dataframe.
1import pandas as pd
2
3data = {'Name': ['Alice', 'Bob', 'Aira', 'Charlie', 'Brian'],
4 'Age': [25, 30, 25, 35, 32]}
5
6df = pd.DataFrame(data)
7
8# Inserting Level Column
9df.assign('Level' = [5, 3, 4, 5, 3])
Data Aggregation
Sum & Mean: Using Sum or Mean on a column of a dataframe is the most basic way of aggregating data. Since this is most basic way, it does not ever lead to correct insights.
1import pandas as pd
2
3data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
4df = pd.DataFrame(data)
5
6# Using Sum
7df['A'].sum()
8
9# Using Mean
10df['B'].mean()
GroupBy: This is one of the most common and useful aggregation method used. GroupBy method let us use the power of aggregation on whole dataframe based on column/columns using different functionality such as sum, mean, first
etc.
1import pandas as pd
2
3data = {
4 'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
5 'Value': [10, 20, 15, 25, 30, 35]
6}
7
8df = pd.DataFrame(data)
9
10group = df.groupby('Category')
11
12# Aggregating based on Count
13group['Value'].count()
Pivot Table: Pivot Table in pandas is very similar to the pivot table we used in excel. It help us to present our data in complete different way. It also give us multiple aggregate functions for data aggregation.
1import pandas as pd
2
3data = {
4 'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
5 'Product': ['A', 'B', 'A', 'B'],
6 'Sales': [100, 150, 200, 180]
7}
8
9df = pd.DataFrame(data)
10
11pivot_table = pd.pivot_table(df, values='Sales', index='Date', columns='Product', aggfunc='sum')
12print(pivot_table)
Conclusion
In this part of our Pandas exploration, we've covered from Data Cleaning to Data Selection. But hold on to your seats, because in Part 3, we'll dive deeper into Pandas techniques and unveil some more.Are you ready to take your data analysis skills to the next level? Join us in nex part, where we'll uncover the full potential of Pandas!In the meantime, feel free to leave your thoughts, questions, or feedback in the comments below. Stay tuned for Part 3, coming soon!