Key features of Pandas include:
- Data Cleaning and Preprocessing: Pandas provides functions for handling missing data, cleaning data, and transforming data into a suitable format for analysis.
- Data Analysis: It supports various operations for analyzing and manipulating data, including filtering, grouping, aggregating, merging, and reshaping.
- Time Series Analysis: Pandas has built-in functionality for working with time series data, making it well-suited for applications in finance, economics, and other fields.
- Integration with Other Libraries: It integrates well with other popular data science libraries, such as NumPy, Matplotlib, and Scikit-Learn.
- IO Tools: Pandas supports reading and writing data from and to various file formats, including CSV, Excel, SQL databases, and more.
To use Pandas, you typically start by importing it into your Python script or Jupyter Notebook:
import pandas as pd
After importing, you can create and manipulate DataFrames and Series, perform data analysis, and visualize your results. Pandas is widely used in data science, machine learning, finance, and many other domains where data manipulation and analysis are essential.
The two primary data structures in Pandas are:
In Pandas, a Series
is a one-dimensional labeled array that can hold any data type. It is similar to a column in a DataFrame or a single column of data in a spreadsheet. The Series
object consists of two main components: the data and the index. The index labels the data, allowing for easy and efficient access to elements.
Here’s a basic example of creating a Pandas Series:
import pandas as pd
# Create a Series from a list
my_series = pd.Series([10, 20, 30, 40, 50])
# Display the Series
print(my_series)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
- The data is
[10, 20, 30, 40, 50]
. - The default index is created (0, 1, 2, 3, 4).
You can also provide your own custom index labels:
import pandas as pd
# Create a Series with a custom index
my_series = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
# Display the Series
print(my_series)
Output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
Now, the Series has custom index labels (‘a’, ‘b’, ‘c’, ‘d’, ‘e’).
You can access elements in a Series using either the default numeric index or the custom index:
# Accessing elements using the default numeric index
print(my_series[2]) # Output: 30
# Accessing elements using the custom index
print(my_series['c']) # Output: 30
Pandas Series are commonly used for various data manipulation and analysis tasks, and they serve as the building blocks for constructing DataFrames, which are two-dimensional tabular data structures.
Most used functions:
set_option
In pandas, pd.set_option
is a function that allows you to set various options that control the behavior and display of pandas. It’s a way to customize the default settings of pandas for better control over its functionality.
For example, you can use pd.set_option
to control the maximum number of rows and columns displayed when you print a DataFrame, set the display precision for floating-point numbers, control the behavior of the info()
method, and more.
Here is an example:
import pandas as pd
# Set the maximum number of rows and columns to display
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 5)
# Set the precision for floating-point numbers
pd.set_option('display.precision', 2)
# Now, when you display a DataFrame, it will follow the specified settings
loc vs iloc
In Pandas, loc
and iloc
are two different indexing methods used to access rows and columns in a DataFrame.
loc
(Label-based Indexing):
loc
is primarily label-based indexing, meaning that you use row and column labels to access data.- It is used to select data by label or a boolean array based on the labels.
- The syntax is
df.loc[row_label, column_label]
. Example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
# Using loc to access data by labels
result_loc = df.loc['row2', 'City']
print(result_loc)
iloc
(Integer-based Indexing):
iloc
is primarily integer-based indexing, meaning that you use integer indices to access data.- It is used to select data by position based on integer location.
- The syntax is
df.iloc[row_index, column_index]
. Example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Using iloc to access data by integer indices
result_iloc = df.iloc[1, 2]
print(result_iloc)
In both examples, the selected data is retrieved from the DataFrame. The key difference is that loc
uses labels for indexing, while iloc
uses integer indices. Choose the appropriate method based on whether you want to access data by label or by position.
drop
The drop
method in Pandas is used to remove specified rows or columns from a DataFrame. It provides a convenient way to eliminate unwanted data and return a new DataFrame with the specified rows or columns removed. The original DataFrame remains unchanged unless you use the inplace
parameter to modify it in place.
Here’s an overview of the drop
method:
DataFrame.drop(labels, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
labels
: The index or column labels to be dropped.axis
: Specifies whether to drop rows (axis=0
) or columns (axis=1
).index
: Alternative to specifyinglabels
for row labels.columns
: Alternative to specifyinglabels
for column labels.level
: For DataFrames with multi-level index (hierarchical index), specifies the level from which to drop labels.inplace
: If True, the original DataFrame is modified in place (default is False).errors
: If ‘raise’, an error is raised if any of the specified labels are not found. If ‘ignore’, any non-existent labels are silently ignored.
Here are some examples:
Drop Rows by Label:
import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
# Drop row with label 'row2'
df_dropped = df.drop(labels='row2', axis=0)
Drop Columns by Label:
# Drop column with label 'B'
df_dropped_columns = df.drop(labels='B', axis=1)
Drop Rows by Index:
# Drop row with index 1
df_dropped_index = df.drop(index=1, axis=0)
In-Place Modification:
# Drop column 'C' in-place
df.drop(labels='C', axis=1, inplace=True)
- Notice that we used
axis=0
to drop a row from a data frame, while we were usingaxis=1
for dropping a column from the data frame. - Also, to make permanent changes to the data frame we will have to use
inplace=True
parameter. - We also see that the index are not correct now as first row has been removed. So, we will have to reset the index of the data frame by reset_index.
reset_index
The reset_index
method in Pandas is used to reset the index of a DataFrame. It is particularly useful when the index needs to be reset after operations that result in a modified DataFrame. When you reset the index, a new default integer index is created, and the old index is added as a new column.
Here’s the general syntax of the reset_index
method:
DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
level
: (optional) Specifies the level(s) of the index to reset. By default, it resets all levels.drop
: If True, the old index is dropped; if False (default), the old index is kept as a new column.inplace
: If True, the original DataFrame is modified in place; if False (default), a new DataFrame with the reset index is returned.col_level
: If the columns are a MultiIndex, specifies the level(s) of the columns to be reset.col_fill
: If specified, it is used to fill the new columns created after resetting the index.
Here are some examples:
Resetting Index and Keeping Old Index as a Column:
import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
# Resetting the index
df_reset = df.reset_index()
In this example, the new DataFrame df_reset
will have a default integer index, and the old index (‘row1’, ‘row2’, ‘row3’) will be added as a new column.
Resetting Index and Dropping Old Index:
# Resetting the index and dropping the old index
df_reset_drop = df.reset_index(drop=True)
In this case, the old index is dropped, and the new DataFrame df_reset_drop
will have a default integer index.
Resetting Index for MultiIndex DataFrame:
# Creating a MultiIndex DataFrame
data_multiindex = {'A': [1, 2, 3], 'B': [4, 5, 6]}
multiindex = pd.MultiIndex.from_tuples([('row1', 'col1'), ('row2', 'col2'), ('row3', 'col3')], names=('index', 'column'))
df_multiindex = pd.DataFrame(data_multiindex, index=multiindex)
# Resetting the index for a MultiIndex DataFrame
df_reset_multiindex = df_multiindex.reset_index()
In this example, the reset_index
method is used with a MultiIndex DataFrame. It will reset all levels of the index, and the new DataFrame will have a default integer index.
The reset_index
method is flexible and can be adjusted based on the specific requirements of the data manipulation task.
concat, merge and join
concat
, merge
, and join
are three different methods in Pandas used for combining or merging DataFrames, each serving different purposes:
concat
:
- The
concat
function is used for concatenating two or more DataFrames along a particular axis (either rows or columns). - It is useful when you want to stack DataFrames vertically (along rows) or horizontally (along columns).
- The syntax is
pd.concat([df1, df2, ...], axis=0)
for vertical concatenation (default), andpd.concat([df1, df2, ...], axis=1)
for horizontal concatenation. Example:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Vertical concatenation
result_vertical = pd.concat([df1, df2], axis=0)
# Horizontal concatenation
result_horizontal = pd.concat([df1, df2], axis=1)
merge
:
- The
merge
function is used for combining DataFrames based on common columns or indices. - It performs SQL-style joins (e.g., inner join, outer join, left join, right join) between DataFrames.
- The syntax is
pd.merge(df1, df2, on='common_column', how='join_type')
. Example:
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'B'], 'value': [3, 4]})
# Inner join based on the 'key' column
result_inner = pd.merge(df1, df2, on='key', how='inner')
# Left join based on the 'key' column
result_left = pd.merge(df1, df2, on='key', how='left')
join
:
- The
join
method is used for combining DataFrames based on their indices. - It is similar to merge but uses the indices by default.
- The syntax is
df1.join(df2, how='join_type')
. Example:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['row1', 'row2'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['row1', 'row2'])
# Left join based on indices
result_left = df1.join(df2, how='left')
Choose the appropriate method based on your specific merging or concatenation requirements. concat
is typically used for simple stacking, while merge
and join
are more powerful for combining data based on common columns or indices.
cut
pd.cut()
is a function provided by the Pandas library in Python for binning data into discrete intervals. It is commonly used to divide a continuous variable into categorical bins.
Here’s how it works:
pd.cut(x, bins, labels=None, ...)
x
: This parameter represents the array-like object to be binned. It could be a Series or an array-like object.bins
: This specifies the bin edges. It can be an integer specifying the number of equal-width bins to use, or a sequence of scalar values indicating the bin edges.labels
(optional): This parameter is used to specify the labels for the resulting bins. If not provided, the default behavior is to return integer codes for the bins.
The pd.cut()
function returns a new categorical Series where each element is assigned to the appropriate bin based on the bin edges provided.
For example:
import pandas as pd
data = pd.DataFrame({'values': [1, 3, 5, 7, 9]})
bins = [0, 4, 8, 10]
labels = ['low', 'medium', 'high']
data['categories'] = pd.cut(data['values'], bins=bins, labels=labels)
This would create a new column categories
in the DataFrame data
, where each value in the ‘values’ column is categorized as ‘low’, ‘medium’, or ‘high’ based on the specified bins and labels.
Some other useful functions that are available to us:
dataset.head(), dataset.tail(), dataset['column'].min(), dataset['column'].max(), dataset['column'].unique(), dataset['column'].value_counts(), dataset['column'].mean(), dataset['column'].median(), dataset['column'].mode()
You could also use other methods like .apply()
to perform custom operations to your dataset.
def profit(s):
return s + s*0.10 # increase of 10%
data['price'].apply(profit)
groupby
The .groupby()
method in Pandas is used to group rows of a DataFrame based on some criteria and perform operations on each group independently. It is a powerful and flexible tool for data analysis and manipulation, especially when you want to apply aggregate functions or transformations to subsets of your data.
The basic syntax of the .groupby()
method is as follows:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False)
by
: The criteria for grouping, which can be a column name, list of column names, Series, or function. It defines the grouping keys.axis
: The axis along which to group (default is 0 for grouping along rows).level
: For DataFrames with multi-level index, specifies the level to group by.as_index
: Whether to use the grouping keys as the index of the resulting DataFrame (default is True).sort
: Whether to sort the groups by the group keys (default is True).group_keys
: Whether to add group keys to the index to identify each group (default is True).squeeze
: Whether to reduce the dimensionality of the return type if possible (default is False).observed
: For categorical data, whether to use only the observed values (default is False).
Here’s a simple example of using .groupby()
:
import pandas as pd
# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)
# Grouping by 'Category' and calculating the mean for each group
grouped_df = df.groupby('Category').mean()
In this example, the DataFrame is grouped by the ‘Category’ column, and the mean of the ‘Value’ column is calculated for each group. The result (grouped_df
) is a new DataFrame with the unique values in the ‘Category’ column as the index.
You can then apply various aggregate functions (such as sum, count, min, max, etc.) or use other methods like .apply()
to perform custom operations on each group. The .groupby()
method is a key tool for performing grouped operations in Pandas.
When as_index
is set to True
(default behavior), the grouping keys become part of the index, creating a hierarchical index if multiple keys are used. On the other hand, when as_index
is set to False
, the grouping keys are not included in the index, and the resulting object is a flat DataFrame.
Here’s an example:
import pandas as pd
# Sample DataFrame
data = {'Category1': ['A', 'B', 'A', 'B', 'A', 'B'],
'Category2': ['X','X','Y','Y','Y','X'],
'Value': [10, 15, 20, 25, 30, 35]}
df = pd.DataFrame(data)
# Grouping with as_index=True (default)
grouped_as_index_true = df.groupby(['Category1','Category2']).sum()
print("Grouped (default)\n")
print(grouped_as_index_true)
# Grouping with as_index=False
grouped_as_index_false = df.groupby(['Category1','Category2'], as_index=False).sum()
print("\nGrouped with as_index=False:\n")
print(grouped_as_index_false)
Grouped (default)
Value
Category1 Category2
A X 10
Y 50
B X 50
Y 25
Grouped with as_index=False:
Category1 Category2 Value
0 A X 10
1 A Y 50
2 B X 50
3 B Y 25
sort_values
The .sort_values()
method in Pandas is used to sort a DataFrame or Series by the values along either axis. It allows you to specify one or more columns by which to sort the data. Here’s the basic syntax:
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
For Series:
Series.sort_values(axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
Parameters:
by
: Specifies the column or columns by which the DataFrame should be sorted. It can be a single column name, a list of column names, or a Series.axis
: The axis along which to sort (0 for sorting along rows, 1 for sorting along columns).ascending
: Whether to sort in ascending order (default is True).inplace
: If True, the DataFrame or Series is modified in place (default is False).kind
: The sorting algorithm (‘quicksort’, ‘mergesort’, ‘heapsort’). Default is ‘quicksort’.na_position
: Specifies the placement of NaN values (‘last’ or ‘first’).
Here’s an example using a DataFrame:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Salary': [50000, 60000, 45000]}
df = pd.DataFrame(data)
# Sorting the DataFrame by the 'Age' column in ascending order
sorted_df = df.sort_values(by='Age', ascending=True)
In this example, the DataFrame df
is sorted by the ‘Age’ column in ascending order. The result (sorted_df
) is a new DataFrame with rows sorted based on the specified column.
to_datetime
The pd.to_datetime()
function in Pandas is used to convert argument to datetime. It can parse a wide variety of date formats and is particularly useful for converting columns in a DataFrame containing date-like strings into datetime objects.
Here’s the basic syntax:
pandas.to_datetime(arg, errors='raise', format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)
arg
: The object to convert to a datetime. This could be a string, a sequence of strings, or a DataFrame column containing date-like strings.errors
: How to handle parsing errors (‘raise’, ‘coerce’, ‘ignore’). Default is ‘raise’.format
: A string with a combination of format codes specifying the expected format of the date string (optional).exact
: Whether to require an exact format match. Default is True.unit
: If arg is a Unix timestamp, specify the unit of the timestamp (optional).infer_datetime_format
: If True and noformat
is provided, infer the datetime format based on the input.origin
: Define the reference date for date parsing (optional).cache
: If True (default), use a cache of unique, converted dates to apply the datetime conversion. If False, disable caching.
Here’s an example:
import pandas as pd
# Creating a DataFrame with a column of date-like strings
data = {'Date': ['2022-01-01', '2022-02-01', '2022-03-01']}
df = pd.DataFrame(data)
# Converting the 'Date' column to datetime objects
df['Date'] = pd.to_datetime(df['Date'])
After this conversion, the ‘Date’ column in the DataFrame will contain datetime objects, which allows for easy manipulation and analysis of date-related data in Pandas.
If your date starts with day then you can do:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
The column ‘date’ is now in datetime format. Now we can change the format of the date to any other format:
df['date'].dt.strftime('%m/%d/%Y') #02/02/2034
df['date'].dt.strftime('%m-%d-%y') #02-14-2034
#you can extract year, month and day where needed
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
To create a new date:
import datetime
date = datetime.date(2022, 1, 29)
print(date)
2022-1-29
to_numeric
In the context of the pd.to_numeric()
function in pandas, the errors='coerce'
parameter specifies how to handle errors during the conversion of non-numeric data to numeric data.
When errors='coerce'
is used, it means that any non-convertible values will be set to NaN (Not a Number). So, if there are any values in the data that cannot be converted to numeric type, instead of raising an error, pandas will replace those values with NaN.
Here’s how it works:
- If a string cannot be converted to a numeric value, it will be replaced with NaN.
- If a string can be converted to a numeric value, it will be converted accordingly.
- Numeric values will remain unchanged.
Using errors='coerce'
is a way to handle data that may contain non-numeric values gracefully, allowing you to work with the converted numeric data while being aware of any problematic entries.
For example:
import pandas as pd
data = {'numeric_column': [1, 2, 'bad_value', 4, 'another_bad_value', 6]}
df = pd.DataFrame(data)
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')
print(df)
Output:
numeric_column
0 1.0
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
In this example, the non-convertible values 'bad_value'
and 'another_bad_value'
are replaced with NaN due to the errors='coerce'
parameter.
Saving and Loading Datasets
read_csv
and read_excel
are two functions in Pandas used to read data from external sources, specifically from CSV (Comma-Separated Values) files and Excel files, respectively
read_csv
read_csv
:
- The
read_csv
function is used to read data from CSV files. - It reads the data into a Pandas DataFrame.
- CSV files are plain text files where values are separated by commas (or other delimiters).
- The syntax is
pd.read_csv('file_path')
. Example:
import pandas as pd
# Reading data from a CSV file
df_csv = pd.read_csv('data.csv')
In this example, data.csv
is a CSV file containing the data, and the function reads it into a DataFrame.
read_excel
read_excel
:
- The
read_excel
function is used to read data from Excel files. - It reads the data into a Pandas DataFrame.
- Excel files may contain multiple sheets, and you can specify the sheet name or index to read.
- The syntax is
pd.read_excel('file_path', sheet_name='sheet_name')
. Example:
import pandas as pd
# Reading data from an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
In this example, data.xlsx
is an Excel file containing the data, and the function reads the specified sheet (‘Sheet1’) into a DataFrame.
Choose between read_csv
and read_excel
based on the format of your data file. If your data is in a CSV format, use read_csv
; if it’s in an Excel format, use read_excel
. Additionally, both functions support various parameters for customization, allowing you to handle different configurations of the data files.
to_csv
The to_csv
method in Pandas is used to write the contents of a DataFrame to a CSV (Comma-Separated Values) file. It allows you to save your DataFrame as a CSV file for further use or sharing with others.
Here’s the basic syntax of the to_csv
method:
DataFrame.to_csv('file_path', sep=',', index=True, header=True)
file_path
: The file path or file-like object where the CSV data will be written.sep
: The delimiter to use between fields in the CSV file (default is,
).index
: Whether to write row names (index) as the first column (default is True).header
: Whether to write the column names as the first line (default is True).
Example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Writing the DataFrame to a CSV file
df.to_csv('output.csv', index=False)
In this example, the to_csv
method is used to write the DataFrame df
to a CSV file named output.csv
. The index=False
parameter is used to exclude writing the index column to the CSV file.
After running this code, you will find a file named output.csv
in your working directory containing the data from the DataFrame.
to_excel
The to_excel
method in Pandas is used to write the contents of a DataFrame to an Excel file. It allows you to save your DataFrame as an Excel spreadsheet for further analysis, reporting, or sharing with others.
Here’s the basic syntax of the to_excel
method:
DataFrame.to_excel('file_path', sheet_name='Sheet1', index=False, header=True)
file_path
: The file path or file-like object where the Excel data will be written.sheet_name
: The name of the sheet in the Excel file (default is ‘Sheet1’).index
: Whether to write row names (index) as the first column (default is True).header
: Whether to write the column names as the first row (default is True).
Example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Writing the DataFrame to an Excel file
df.to_excel('output.xlsx', index=False, sheet_name='MySheet')
In this example, the to_excel
method is used to write the DataFrame df
to an Excel file named output.xlsx
. The index=False
parameter is used to exclude writing the index column to the Excel file, and sheet_name='MySheet'
specifies the sheet name as ‘MySheet’.
After running this code, you will find a file named output.xlsx
in your working directory containing the data from the DataFrame.