Data cleaning is one of the most common and important tasks in data analysis.
In a typical data analysis configuration, we can extract our dataset from an excel/csv/tsv file and perform a series of operations to clean up the data. For example, we will start deleting variable names to make them consistent, renaming some column names, filtering our empty rows or columns, selecting one or more columns, and creating new columns in the data. In Python, Pandas has a lot of functions such as renaming(), filtering() and query() which we can use to clean up data before applying the machine learning algorithms.
Data Cleaning with
The new Python Batch Pijanizer, inspired by the wiper in the R-package, has recently facilitated some data cleaning tasks. You can think of pyjanitor as an expansion package for pandas that allows you to work with the panda data framework with new data-cleaning functions that work with a range of multiple functions, but with function names that are verbs describing our action.
The Github page of Pytjanitor explains the objectives well.
The etymology of the pigeon loft has a double link with the purity. The first step is to add easy-to-use data cleansing procedures to Pandas. Second, to provide a cleaner, verb-based API for common pandas.
In this article we will see how to use Pyjanitor for the most common data cleaning steps. We will use the data set of the toy to explore the data-cleaning functions in the pajamas.
import pandas as pd
import numbering as np
First, let’s make sure we have a drawer installed. The loft can be installed with the conda package manager
konda installer pižanitor -c konda-anvil
Import the soldering iron and check its version.
import pyjanitor.__version__ 0.20.10.
Let’s create a toy data frame using a dictionary that contains column names as keys and column values as lists.
Stock = { company name : Roku, Google, pd.NA,
Date: 20202912,20202912,pd.NA],
Share price: [300,1700,pd.NA],
DIvidend: [pd.NA,pd.NA], }
We can convert the dictionary to a panda data frame using from_dict() in pandas.
modes_df = pd.DataFrame.from_dict(modes)
modes_df
Note that our data framework for toys has a number of common problems that we generally need to address before analyzing the data. A column header, for example, has B. Two works in a camel skin script, another has two words with spaces, all in capital letters, and another has a random mixed case. And also a column that is empty and a row that is empty.
Company name DATE STOCK Price DIvidend
0 Roku 20202912 300
1 Google 20202912 1700
2
Let’s see how we can clean this data toy using Pagemaker’s features.
Delete column names with clean_names() in the pyjanitor
We can use the clean_names() function of Pyjanitor to delete column names from the Pandas data frame. In our example we see that Pyjanitors’ clean_names() function has converted all names to lowercase, the column name with a space between two words is underlined. The column name is now in capital letters, and a capital letter is now a word.
stocks_df.clean_names()
Date Company name Share price Dividend
0 Roku 20202912 300
1 Google 20202912 1700
2
Remove empty columns and rows with remove_empty() in bin
A common problem when using data from Excel or manually created data is that you often find columns and rows that are completely empty. Our toy dataset contains a row and a column that are completely empty. We can use Pyjanitor’s remove_empty() function to easily remove an empty row and column. We can also use concatenation with another function to remove empty rows/columns.
In the following example we start deleting the names and use a chain operation to delete the empty row and column. And we use brackets to connect different functions on different lines.
(stocks_df
.clean_names()
.remove_empty())
Date company_name share_price
0 Roku 20202912 300
1 Google 20202912 1700
Rename the column with rename_column(() in pager
We can rename columns in the data frame with the function rename_column() in Pyjanitor. Here we rename the column company name.
(Shares_df
.clean_names()
.remove_empty()
.rename_column(‘companyname’,company))
Date of issue of the shares
0 Roku 20202912 300
1 Google 20202912 1700
Add a new column with add_column() in pager
We can also add new columns to the data frame with the function pyjanitor add_column(). Here we add the size of the column no by specifying the column values in list form.
(shares_df
.clean_names()
.remove_empty()
.rename_column(company name, company)
.add_column(size, [1000.40000])
Date Company name Share price Size
0 Roku 20202912 300 1000
1 Google 20202912 1700 40000
Panda chain and drawer functions
So far we have seen a number of functions of the pyjamas and showed how we can link different functions together. Since Pigeonator is an extension of Pandas, we can also combine the characteristics of Pigeonator with those of Pandas.
In the example below we use the to_datetime() function of pandas to convert the date from string format to date format.
(shares_df
.clean_names()
.remove_empty()
.rename_column(‘company name’,company)
.add_column(size, [1000,40000])
.to_datetime(‘date’,format=%Y%d%m)))
Date of publication
0 Roku 2020-12-29 300 1000
1 Google 2020-12-29 1700 40000
We can back up deleted data
shares_clean = (shares_df
.clean_names()
.remove_empty()
.rename_column(‘companyname’,company)
.add_column(size, [1000,40000])
.to_datetime(‘date’,format=%Y%d%m)))
and check the data types
warehouse_cleaning.d-types
business object
datetime64 [ns]
stock_price object
size int64
d-type: object
pyjanitor began as a wiper package for the R-package and gradually gained new functions, including data rearrangement capabilities such as tidyrs pivot_longer(). Keep up to date with an article on the use of pipe monitor’s pivot_longer() function to reformat extensive data as tody data.
Related Tags:
pyjanitor github,pandas-flavor,df column_name,conda install pyjanitor,pyjanitor explode,add method to pandas dataframe,pyjanitor chemistry,janitor py,data cleaning in python ppt,clean data python,pip install pyjanitor,pyjanitor examples,pyjanitor clean_names,pyjanitor functions,pypi pyjanitor