What is Pandas?
Pandas is Python’s fundamental data analysis library that provides data structures to work with table like data. It runs on the top of NumPy. Pandas is open-source and easy to use for data analysis. Pandas can deal with
- Table like data, for example SQL table or Excel Spreadsheet
- Time series data in ordered or unordered format
- Arbitrary matrix data
- Slicing, Indexing and Subsetting using labels
- Handling missing data (NaN)
- Group by functions
- Munging and cleaning data, analysing data, plotting and tabular representation.
Pandas vs NumPy
- Pandas is built on top of NumPy
- Pandas has high level data structures (data frame). NumPy has low-level data structures (numpy.array)
- Panda is great for handling tabular data and performing data alignment, group by, merge and join etc. NumPy is fantastic for mathematical array operations.
Pandas Series and DataFrames
Series and DataFrames are the most important objects in Pandas.
Pandas Series
Series is a one-dimensional labelled array.
- Data must be homogeneous (same type)
- Data can be any type like integer, float, python object, string etc
- Data must always have an index
Create Panda Series
There are many methods of creating Pandas Series. Below, we will go through some of different ways of creating Series.
- From a list
- From a dictionary
- From numpy.ndarray
- From a file
Convert a Python list to Pandas Series
- First of all, we will import pandas library to our programme by using import pandas as pd
- Now we will convert Python list to pandas series using the Series(name of list) constructor
1 2 3 4 5 6 7 8 9 10 11 |
# import pandas library as pd import pandas as pd # Create a list days = [31, 28, 31, 30, 31, 30] # convert python lists to a Pandas series using pd.series series_from_list = pd.Series(days) # show the result print(series_from_list) |
The above code will produce this output
0 31
1 28
2 31
3 30
4 31
5 30
dtype: int64
As you can see that an index has automatically been assigned to data. We can also specify our own index, in example below
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# import pandas library as pd import pandas as pd # Create a list days = [31, 28, 31, 30, 31, 30] months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'] # convert python lists to a Pandas series using pd.series series_from_list = pd.Series(days, months) # series_from_list = pd.Series(days, index = months) # show the result print(series_from_list) |
The above code will produce this output
Jan 31
Feb 28
Mar 31
Apr 30
May 31
Jun 30
dtype: int64
Convert a Python dictionary to Pandas Series
1 2 3 4 5 6 7 8 9 10 11 |
# import pandas library as pd import pandas as pd # Create a dictionary dict = {'Jan' : 31, 'Feb' : 28, 'Mar' : 31, 'Apr' : 30, 'May' : 31, 'Jun' : 30} # convert python dictionary to a Pandas series using pd.series series_from_dict = pd.Series(dict) # This is an n dimensional array print(series_from_dict) |
The above code will produce this output
Jan 31
Feb 28
Mar 31
Apr 30
May 31
Jun 30
dtype: int64
Convert a numpy array to Pandas Series
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# import numpy library as np import numpy as np # import pandas library as pd import pandas as pd # Arange will create an array filled with a sequence of number excluding the stop arange_array = np.arange(start = 1, stop = 20, step = 2) print(arange_array) print() # Create pandas series from numpy array series_from_numpy_array = pd.Series(arange_array) print(series_from_numpy_array) |
The above code will produce this output
[ 1 3 5 7 9 11 13 15 17 19]
0 1
1 3
2 5
3 7
4 9
5 11
6 13
7 15
8 17
9 19
dtype: int64
Vectorised operations on Pandas Series
Just like a numpy array, vectorised operations can work on Pandas Series as well. For example, we can add, multiply and divide all elements of a pandas Series to a number, see an example below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# import numpy library as np import numpy as np # import pandas library as pd import pandas as pd # Create a list days = [31, 28, 31, 30, 31, 30] # convert python lists to a Pandas series using pd.series series_from_list = pd.Series(days) # show the result print(series_from_list) print() print(series_from_list * 2) print() print(1 + series_from_list) print() print(np.square(series_from_list)) print() print(np.mean(series_from_list)) |
The above code will produce this output
0 31
1 28
2 31
3 30
4 31
5 30
dtype: int64
0 62
1 56
2 62
3 60
4 62
5 60
dtype: int64
0 32
1 29
2 32
3 31
4 32
5 31
dtype: int64
0 961
1 784
2 961
3 900
4 961
5 900
dtype: int64
30.166666666666668
Pandas DataFrames
DataFrame is a two-dimensional labelled array.
- Columns can be heterogeneous (different data type) like a spreadsheet or SQL table
- Data will have x-index and y-index
Create Pandas DataFrame
There are many methods of creating Pandas DataFrame. Below, we will go through some of methods. We can create a Pandas DataFrame:
- From a list
- From a dictionary
- From a Series
- From 2D numpy.ndarray
- From a file like text, Execel, CSV or database
Import data from file to Pandas DataFrame
Pandas have many functions to read data from external files. In example below we are using read_excel function to read data from an external Excel file that show employee attrition.
The columns attribute will show us all the column names.
The index attribute will show us all the index names.
The values attribute will show us all the values.
1 2 3 4 5 6 7 8 9 10 11 12 |
# import pandas library as pd import pandas as pd # location of the file file_path = 'https://github.com/apischdo/Artificial-Intelligence-and-Data-Science/blob/master/WA_Fn-UseC_-HR-Employee-Attrition.xlsx?raw=true' # read from file using read_excel function # file_data = pd.read_excel(file_path) file_data = pd.read_excel(io=file_path, sheet_name = 0, index_col = 'EmployeeNumber') # show column names print(file_data.columns) |
The above code will produce this output
Index([‘Age’, ‘Attrition’, ‘BusinessTravel’, ‘DailyRate’, ‘Department’,
‘DistanceFromHome’, ‘Education’, ‘EducationField’, ‘EmployeeCount’,
‘EnvironmentSatisfaction’, ‘Gender’, ‘HourlyRate’, ‘JobInvolvement’,
‘JobLevel’, ‘JobRole’, ‘JobSatisfaction’, ‘MaritalStatus’,
‘MonthlyIncome’, ‘MonthlyRate’, ‘NumCompaniesWorked’, ‘Over18’,
‘OverTime’, ‘PercentSalaryHike’, ‘PerformanceRating’,
‘RelationshipSatisfaction’, ‘StandardHours’, ‘StockOptionLevel’,
‘TotalWorkingYears’, ‘TrainingTimesLastYear’, ‘WorkLifeBalance’,
‘YearsAtCompany’, ‘YearsInCurrentRole’, ‘YearsSinceLastPromotion’,
‘YearsWithCurrManager’],
dtype=’object’)