Python Pandas

What is Pandas?

Pandas is Python’s fundamental data analysis library that provides data structures to work with table like data. It runs on the top of NumPy. Pandas is open-source and easy to use for data analysis. Pandas can deal with

Table like data, for example SQL table or Excel Spreadsheet
Time series data in ordered or unordered format
Arbitrary matrix data
Slicing, Indexing and Subsetting using labels
Handling missing data (NaN)
Group by functions
Munging and cleaning data, analysing data, plotting and tabular representation.

Pandas vs NumPy

Pandas is built on top of NumPy
Pandas has high level data structures (data frame). NumPy has low-level data structures (numpy.array)
Panda is great for handling tabular data and performing data alignment, group by, merge and join etc. NumPy is fantastic for mathematical array operations.

Pandas Series and DataFrames

Series and DataFrames are the most important objects in Pandas.

Pandas Series

Series is a one-dimensional labelled array.

Data must be homogeneous (same type)
Data can be any type like integer, float, python object, string etc
Data must always have an index

Create Panda Series

There are many methods of creating Pandas Series. Below, we will go through some of different ways of creating Series.

From a list
From a dictionary
From numpy.ndarray
From a file

Convert a Python list to Pandas Series

First of all, we will import pandas library to our programme by using import pandas as pd
Now we will convert Python list to pandas series using the Series(name of list) constructor

# import pandas library as pd
import pandas as pd

# Create a list
days = [31, 28, 31, 30, 31, 30]

# convert python lists to a Pandas series using pd.series
series_from_list = pd.Series(days)

# show the result
print(series_from_list)

# import pandas library as pd

import pandas as pd

# Create a list

days = [31, 28, 31, 30, 31, 30]

# convert python lists to a Pandas series using pd.series

series_from_list = pd.Series(days)

# show the result

print(series_from_list)

The above code will produce this output

0 31
1 28
2 31
3 30
4 31
5 30
dtype: int64

As you can see that an index has automatically been assigned to data. We can also specify our own index, in example below

# import pandas library as pd
import pandas as pd

# Create a list
days = [31, 28, 31, 30, 31, 30]
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

# convert python lists to a Pandas series using pd.series
series_from_list = pd.Series(days, months)
# series_from_list = pd.Series(days, index = months)

# show the result
print(series_from_list)

# import pandas library as pd

import pandas as pd

# Create a list

days = [31, 28, 31, 30, 31, 30]

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

# convert python lists to a Pandas series using pd.series

series_from_list = pd.Series(days, months)

# series_from_list = pd.Series(days, index = months)

# show the result

print(series_from_list)

The above code will produce this output

Jan 31
Feb 28
Mar 31
Apr 30
May 31
Jun 30
dtype: int64

Convert a Python dictionary to Pandas Series

# import pandas library as pd
import pandas as pd

# Create a dictionary
dict = {'Jan' : 31, 'Feb' : 28, 'Mar' : 31, 'Apr' : 30, 'May' : 31, 'Jun' : 30}

# convert python dictionary to a Pandas series using pd.series
series_from_dict = pd.Series(dict)

# This is an n dimensional array
print(series_from_dict)

# import pandas library as pd

import pandas as pd

# Create a dictionary

dict = {'Jan' : 31, 'Feb' : 28, 'Mar' : 31, 'Apr' : 30, 'May' : 31, 'Jun' : 30}

# convert python dictionary to a Pandas series using pd.series

series_from_dict = pd.Series(dict)

# This is an n dimensional array

print(series_from_dict)

The above code will produce this output

Jan 31
Feb 28
Mar 31
Apr 30
May 31
Jun 30
dtype: int64

Convert a numpy array to Pandas Series

# import numpy library as np
import numpy as np

# import pandas library as pd
import pandas as pd

# Arange will create an array filled with a sequence of number excluding the stop
arange_array = np.arange(start = 1, stop = 20, step = 2)
print(arange_array)
print()

# Create pandas series from numpy array
series_from_numpy_array = pd.Series(arange_array)
print(series_from_numpy_array)

# import numpy library as np

import numpy as np

# import pandas library as pd

import pandas as pd

# Arange will create an array filled with a sequence of number excluding the stop

arange_array = np.arange(start = 1, stop = 20, step = 2)

print(arange_array)

print()

# Create pandas series from numpy array

series_from_numpy_array = pd.Series(arange_array)

print(series_from_numpy_array)

The above code will produce this output

[ 1 3 5 7 9 11 13 15 17 19]

0 1
1 3
2 5
3 7
4 9
5 11
6 13
7 15
8 17
9 19
dtype: int64

Vectorised operations on Pandas Series

Just like a numpy array, vectorised operations can work on Pandas Series as well. For example, we can add, multiply and divide all elements of a pandas Series to a number, see an example below

# import numpy library as np
import numpy as np

# import pandas library as pd
import pandas as pd

# Create a list
days = [31, 28, 31, 30, 31, 30]

# convert python lists to a Pandas series using pd.series
series_from_list = pd.Series(days)

# show the result
print(series_from_list)
print()
print(series_from_list * 2)
print()
print(1 + series_from_list)
print()
print(np.square(series_from_list))
print()
print(np.mean(series_from_list))

# import numpy library as np

import numpy as np

# import pandas library as pd

import pandas as pd

# Create a list

days = [31, 28, 31, 30, 31, 30]

# convert python lists to a Pandas series using pd.series

series_from_list = pd.Series(days)

# show the result

print(series_from_list)

print()

print(series_from_list * 2)

print()

print(1 + series_from_list)

print()

print(np.square(series_from_list))

print()

print(np.mean(series_from_list))

The above code will produce this output

0 31
1 28
2 31
3 30
4 31
5 30
dtype: int64

0 62
1 56
2 62
3 60
4 62
5 60
dtype: int64

0 32
1 29
2 32
3 31
4 32
5 31
dtype: int64

0 961
1 784
2 961
3 900
4 961
5 900
dtype: int64

30.166666666666668

Pandas DataFrames

DataFrame is a two-dimensional labelled array.

Columns can be heterogeneous (different data type) like a spreadsheet or SQL table
Data will have x-index and y-index

Create Pandas DataFrame

There are many methods of creating Pandas DataFrame. Below, we will go through some of methods. We can create a Pandas DataFrame:

From a list
From a dictionary
From a Series
From 2D numpy.ndarray
From a file like text, Execel, CSV or database

Import data from file to Pandas DataFrame

Pandas have many functions to read data from external files. In example below we are using read_excel function to read data from an external Excel file that show employee attrition.

The columns attribute will show us all the column names.

The index attribute will show us all the index names.

The values attribute will show us all the values.

# import pandas library as pd
import pandas as pd

# location of the file
file_path = 'https://github.com/apischdo/Artificial-Intelligence-and-Data-Science/blob/master/WA_Fn-UseC_-HR-Employee-Attrition.xlsx?raw=true'

# read from file using read_excel function
# file_data = pd.read_excel(file_path)
file_data = pd.read_excel(io=file_path, sheet_name = 0, index_col = 'EmployeeNumber')

# show column names
print(file_data.columns)

# import pandas library as pd

import pandas as pd

# location of the file

file_path = 'https://github.com/apischdo/Artificial-Intelligence-and-Data-Science/blob/master/WA_Fn-UseC_-HR-Employee-Attrition.xlsx?raw=true'

# read from file using read_excel function

# file_data = pd.read_excel(file_path)

file_data = pd.read_excel(io=file_path, sheet_name = 0, index_col = 'EmployeeNumber')

# show column names

print(file_data.columns)

The above code will produce this output

Index([‘Age’, ‘Attrition’, ‘BusinessTravel’, ‘DailyRate’, ‘Department’,
‘DistanceFromHome’, ‘Education’, ‘EducationField’, ‘EmployeeCount’,
‘EnvironmentSatisfaction’, ‘Gender’, ‘HourlyRate’, ‘JobInvolvement’,
‘JobLevel’, ‘JobRole’, ‘JobSatisfaction’, ‘MaritalStatus’,
‘MonthlyIncome’, ‘MonthlyRate’, ‘NumCompaniesWorked’, ‘Over18’,
‘OverTime’, ‘PercentSalaryHike’, ‘PerformanceRating’,
‘RelationshipSatisfaction’, ‘StandardHours’, ‘StockOptionLevel’,
‘TotalWorkingYears’, ‘TrainingTimesLastYear’, ‘WorkLifeBalance’,
‘YearsAtCompany’, ‘YearsInCurrentRole’, ‘YearsSinceLastPromotion’,
‘YearsWithCurrManager’],
dtype=’object’)

What is Pandas?

Pandas vs NumPy

Pandas Series and DataFrames

Pandas Series

Create Panda Series

Convert a Python list to Pandas Series

Convert a Python dictionary to Pandas Series

Convert a numpy array to Pandas Series

Vectorised operations on Pandas Series

Pandas DataFrames

Create Pandas DataFrame

Import data from file to Pandas DataFrame

Python

Python Basics

Python Data Science