Home | curlylogic

Pandas Unleashed: Embarking on a Data Journey

Introduction

We all are living in a world of data. We are creating data every second or even faster but what is this data for? What do we do with this load of data? All of this data is used to make something better than before. We use traffic data to reduce traffic, we use accident data to stop or lower accident rates, and we use sales data to increase sales. We use data in almost everything. But how do we achieve that? How do we find the result to improve the future based on the data?
Here comes Data Analysis in the picture. Data Analysis is the process that helps to find out the results and improve the future based on the data. All of the data goes through the analysis process to get the best out of it. Without data analysis, all this data will be just a huge file without any use.
(Please refer to Data Analysis: Turning Information into Insights to learn more about Data Analysis.)

How do we do Data Analysis?

In the realm of data analysis, the ability to manipulate, clean, and transform data efficiently is paramount. Pandas is a powerful Python library that has revolutionized the way data analysts and scientists handle data. Here, we will embark on a journey to explore the depths of Pandas, understanding its core functionalities, data structures, and techniques. Whether you’re a beginner or a seasoned analyst, Pandas offers a rich toolkit to tackle diverse data challenges.

About Pandas

Pandas is an open-source library built on Python that provides high-performance data manipulation and analysis. It was developed by Wes McKinney in 2008 when he was a researcher in the AQR capital. Pandas allow us to analyze big data and make conclusions based on statistical theories. But first, we have to include it in our system and import it.

First download pandas from Pip or you can use Conda too and import it in.

python

1# Installing Pandas in the system
2
3# Using Pip
4pip install pandas
5
6# Using conda
7conda install pandas

Now you have pandas in your system let's import it.

python

1# Importing Pandas as pd
2import pandas as pd

Now as we have imported pandas, we can use its power.

Data Structure

Pandas has two primary data structures Series and DataFrame. A Series is a one-dimensional array or list-like structure whereas a DataFrame is a two-dimensional array or list-like structure. Pandas have many built-in functions that make them very helpful in Data Analysis and almost make it a one-stop solution for basic analysis.

Creating Pandas Series

From a list or numy array

python

1import pandas as pd
2import numpy as np
3
4# Creating a Series from a list
5data_list = [10, 20, 30, 40, 50]
6series_from_list = pd.Series(data_list)
7
8# Creating a Series from a NumPy array
9data_array = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
10series_from_array = pd.Series(data_array)

From a dictionary

python

1data_dict = {'A': 100, 'B': 200, 'C': 300}
2
3series_from_dict = pd.Series(data_dict)

Using `pd.series()` constructor/method

python

1series = pd.Series([1, 2, 3, 4, 5])

Creating Pandas DataFrame

From a dictionary of list or numpy array

python

1data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
2             'Age': [25, 30, 35]}
3
4df_from_dict = pd.DataFrame(data_dict)

From a list of dictionaries

python

1data_list = [{'Name': 'Alice', 'Age': 25},
2             {'Name': 'Bob', 'Age': 30},
3             {'Name': 'Charlie', 'Age': 35}]
4
5df_from_list_of_dicts = pd.DataFrame(data_list)

Using pd.DataFrame() constructor/method

python

1df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
2        'Age': [25, 30, 35]})

From a numpy array

python

1data_array = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
2
3df_from_array = pd.DataFrame(data_array, columns=['A', 'B', 'C'])

From a list of series

python

1series1 = pd.Series([1, 2, 3])
2
3series2 = pd.Series([4, 5, 6])
4
5df_from_series = pd.DataFrame([series1, series2], columns=['A', 'B', 'C'])

Reading the data

Pandas give you a lot of options to read data from different file formats. You can read data from a CSV file, a TSV file, an Excel file, and a whole list. Let’s look at the list of format pandas support.

CSV (Comma Separated Values)
FWF (Fixed Width Text File)
JSON (Javascript Object Notation)
HTML (Hyper Text Markup Language)
LaTex
XML (Extensible Markup Language)
Local Clipboard
XLSX/XLS (Microsoft Excel)
Open Document
HDF5 Format (Hierarchical Data Format)
Feather Format
Parquet Format
ORC (Optimized Row Columnar)
Stata
SAS (Statistical Analysis System)
SPSS (Statistical Package for Social Science)
Python Pickle Format
SQL (Structured Query Language)
Google BigQuery

You can read all of the above-mentioned data formats with Pandas. Read More.

python

1# Read Basic CSV Format
2csv_data = pd.read_csv('Your File Path')
3```
4
5```
6# Read XLSX Format
7xlsx_data = pd.read_excel('Your File Path')

Data Inspection

We have different methods to check out the data, its shape, size, and many other details.

Shape: The shape property of any dataframe is the representation of its total number of rows and columns in a tuple form.

python

1csv_data.shape

Head: The head method will apply to any DataFrame or Series. It will give you the top 5 rows of your DataFrame or series. It takes one argument, which is the number of rows you want to see. The default value of this is 5.

python

1# Default Version
2csv_data.head()
3
4# With Argument
5csv_data.head(15)

Tail: The tail method is opposite to the head method with functionality being the same. It also takes one argument, the number of rows as its parameter. Its default value is also 5.

python

1# Default Version
2csv_data.tail()
3
4# With Argument
5csv_data.tail(12)

Sample: The sample method also lets you view a fraction of your data but it will show you random data points whereas the head method shows you the top n rows and the tail shows the bottom n rows. This method is useful when you want to check the irregularity in your data in a glimpse. It has 7 parameters and none are mandatory. It will show you 1 data point by default.

python

1# Default Version
2csv_data.sample()
3
4# With parameter "n". It takes an integer value and can't be passed with "frac".
5csv_data.head(n=15)
6
7# With parameter "frac". It takes an float value and can't be passed with "n".
8csv_data.head(frac=0.5)

Describe: The describe() method generates summary statistics for the numerical columns in a DataFrame. It provides key statistics such as count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum. This method helps you quickly understand the distribution and central tendencies of your data.

python

1# Creating a sample DataFrame
2data = {'Age': [25, 30, 35, 40, 45],
3        'Salary': [50000, 60000, 75000, 90000, 80000]}
4df = pd.DataFrame(data)
5
6# Using describe() to generate summary statistics
7summary_stats = df.describe()

Info: The info() method provides a concise summary of the DataFrame, including the data types of each column and the number of non-null values. This is particularly useful for identifying missing data and understanding the structure of your DataFrame.

python

1# Creating a sample DataFrame with missing values
2data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
3        'Age': [25, 30, None, 40, 45],
4        'City': ['NY', 'LA', 'SF', 'CHI', 'MIA']}
5df = pd.DataFrame(data)
6
7# Using info() to get DataFrame information
8df_info = df.info()

Conclusion

In this first part of our Pandas exploration, we've covered the fundamentals, from creating Pandas series, DataFrame to performing basic operations. You've just scratched the surface of what this powerful library can do. But hold on to your seats, because in Part 2, we'll dive deeper into Pandas techniques and unveil some of its hidden gems.

Are you ready to take your data analysis skills to the next level? Join us in Part 2, where we'll uncover the full potential of Pandas!

In the meantime, feel free to leave your thoughts, questions, or feedback in the comments below.

Stay tuned for Part 2, coming soon!

Return To All articles