Cleaning Data¶
Data cleaning means fixing bad data in your data set. Bad data could be:
- Empty cells
- Data in wrong format
- Wrong data
- Duplicates
In [66]:
Copied!
import pandas as pd
data = [
[1,60,'2020/12/01',110,130,409.1],
[2,60,'2020/12/02',117,145,479.0],
[3,60,'2020/12/03',103,135,340.0],
[4,45,'2020/12/04',109,175,282.4],
[5,450,'2020/12/08',104,134,253.3],
[6,45,'2020/12/05',117,148,406.0],
[7,60,'2020/12/06',102,127,300.0],
[8,60,'2020/12/07',110,136,374.0],
[9,30,'2020/12/09',109,133,195.1],
[10,45,None,100,119,282.0],
[11,60,'2020/12/10',98,124,269.0],
[12,60,'2020/12/23',130,101,300.0],
[13,45,'2020/12/24',105,132,246.0],
[14,60,'2020/12/25',102,126,334.5],
[15,60,20201226,100,120,250.0],
[16,60,'2020/12/27',92,118,241.0],
[17,60,'2020/12/28',103,132,],
[18,60,'2020/12/29',100,132,280.0],
[19,60,'2020/12/30',102,129,380.3],
[19,60,'2020/12/30',102,129,380.3]
]
df = pd.DataFrame(data, columns =['SN','Duration','Date','Pulse','Maxpulse','Calories'])
print(df)
import pandas as pd
data = [
[1,60,'2020/12/01',110,130,409.1],
[2,60,'2020/12/02',117,145,479.0],
[3,60,'2020/12/03',103,135,340.0],
[4,45,'2020/12/04',109,175,282.4],
[5,450,'2020/12/08',104,134,253.3],
[6,45,'2020/12/05',117,148,406.0],
[7,60,'2020/12/06',102,127,300.0],
[8,60,'2020/12/07',110,136,374.0],
[9,30,'2020/12/09',109,133,195.1],
[10,45,None,100,119,282.0],
[11,60,'2020/12/10',98,124,269.0],
[12,60,'2020/12/23',130,101,300.0],
[13,45,'2020/12/24',105,132,246.0],
[14,60,'2020/12/25',102,126,334.5],
[15,60,20201226,100,120,250.0],
[16,60,'2020/12/27',92,118,241.0],
[17,60,'2020/12/28',103,132,],
[18,60,'2020/12/29',100,132,280.0],
[19,60,'2020/12/30',102,129,380.3],
[19,60,'2020/12/30',102,129,380.3]
]
df = pd.DataFrame(data, columns =['SN','Duration','Date','Pulse','Maxpulse','Calories'])
print(df)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 450 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 NaN 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 19 60 2020/12/30 102 129 380.3
- The data set contains some empty cells ("Date" in row 10, and "Calories" in row 17)
- The data set contains wrong format ("Date" in row 15)
- The data set contains wrong data ("Duration" in row 5)
- The data set contains duplicates (row 19 and 20)
Cleaning Empty Cells¶
Remove Rows¶
In [56]:
Copied!
df1 = df.dropna()
print(df1)
print(df1.shape)
df1 = df.dropna()
print(df1)
print(df1.shape)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 450 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3 (18, 6)
Replace Empty Values¶
In [57]:
Copied!
df1 = df.fillna(130)
print(df1)
df1 = df.fillna(130)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 450 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 130 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 130.0 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3
Replace only for Specified Columns¶
In [41]:
Copied!
df1 = df
df1["Calories"].fillna(130, inplace = True)
print(df1)
df1 = df
df1["Calories"].fillna(130, inplace = True)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 450 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 130.0 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3
Replace Using Mean, Median, or Mode¶
In [37]:
Copied!
df1 = df.copy()
x = df1["Calories"].mean()
df1["Calories"].fillna(x, inplace = True)
print(df1)
df1 = df.copy()
x = df1["Calories"].mean()
df1["Calories"].fillna(x, inplace = True)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.100000 1 2 60 2020/12/02 117 145 479.000000 2 3 60 2020/12/03 103 135 340.000000 3 4 45 2020/12/04 109 175 282.400000 4 5 450 2020/12/08 104 134 253.300000 5 6 45 2020/12/05 117 148 406.000000 6 7 60 2020/12/06 102 127 300.000000 7 8 60 2020/12/07 110 136 374.000000 8 9 30 2020/12/09 109 133 195.100000 9 10 45 None 100 119 282.000000 10 11 60 2020/12/10 98 124 269.000000 11 12 60 2020/12/23 130 101 300.000000 12 13 45 2020/12/24 105 132 246.000000 13 14 60 2020/12/25 102 126 334.500000 14 15 60 20201226 100 120 250.000000 15 16 60 2020/12/27 92 118 241.000000 16 17 60 2020/12/28 103 132 315.894737 17 18 60 2020/12/29 100 132 280.000000 18 19 60 2020/12/30 102 129 380.300000 19 20 60 2020/12/30 102 129 380.300000
In [42]:
Copied!
df1 = df.copy()
x = df1["Calories"].median()
df1["Calories"].fillna(x, inplace = True)
print(df1)
df1 = df.copy()
x = df1["Calories"].median()
df1["Calories"].fillna(x, inplace = True)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 450 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 130.0 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3
In [48]:
Copied!
df1 = df.copy()
x = df1["Calories"].mode()[0]
df1["Calories"].fillna(x, inplace = True)
print(df1)
df1 = df.copy()
x = df1["Calories"].mode()[0]
df1["Calories"].fillna(x, inplace = True)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 450 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 130.0 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3
Cleaning Wrong Format¶
Convert Into a Correct Format¶
In [50]:
Copied!
df1 = df.copy()
df1['Date'] = pd.to_datetime(df1['Date'])
print(df1)
df1 = df.copy()
df1['Date'] = pd.to_datetime(df1['Date'])
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020-12-01 00:00:00.000000000 110 130 409.1 1 2 60 2020-12-02 00:00:00.000000000 117 145 479.0 2 3 60 2020-12-03 00:00:00.000000000 103 135 340.0 3 4 45 2020-12-04 00:00:00.000000000 109 175 282.4 4 5 450 2020-12-08 00:00:00.000000000 104 134 253.3 5 6 45 2020-12-05 00:00:00.000000000 117 148 406.0 6 7 60 2020-12-06 00:00:00.000000000 102 127 300.0 7 8 60 2020-12-07 00:00:00.000000000 110 136 374.0 8 9 30 2020-12-09 00:00:00.000000000 109 133 195.1 9 10 45 NaT 100 119 282.0 10 11 60 2020-12-10 00:00:00.000000000 98 124 269.0 11 12 60 2020-12-23 00:00:00.000000000 130 101 300.0 12 13 45 2020-12-24 00:00:00.000000000 105 132 246.0 13 14 60 2020-12-25 00:00:00.000000000 102 126 334.5 14 15 60 1970-01-01 00:00:00.020201226 100 120 250.0 15 16 60 2020-12-27 00:00:00.000000000 92 118 241.0 16 17 60 2020-12-28 00:00:00.000000000 103 132 130.0 17 18 60 2020-12-29 00:00:00.000000000 100 132 280.0 18 19 60 2020-12-30 00:00:00.000000000 102 129 380.3 19 20 60 2020-12-30 00:00:00.000000000 102 129 380.3
Removing Rows¶
In [53]:
Copied!
df1.dropna(subset=['Date'], inplace = True)
print(df1)
df1.dropna(subset=['Date'], inplace = True)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020-12-01 00:00:00.000000000 110 130 409.1 1 2 60 2020-12-02 00:00:00.000000000 117 145 479.0 2 3 60 2020-12-03 00:00:00.000000000 103 135 340.0 3 4 45 2020-12-04 00:00:00.000000000 109 175 282.4 4 5 450 2020-12-08 00:00:00.000000000 104 134 253.3 5 6 45 2020-12-05 00:00:00.000000000 117 148 406.0 6 7 60 2020-12-06 00:00:00.000000000 102 127 300.0 7 8 60 2020-12-07 00:00:00.000000000 110 136 374.0 8 9 30 2020-12-09 00:00:00.000000000 109 133 195.1 10 11 60 2020-12-10 00:00:00.000000000 98 124 269.0 11 12 60 2020-12-23 00:00:00.000000000 130 101 300.0 12 13 45 2020-12-24 00:00:00.000000000 105 132 246.0 13 14 60 2020-12-25 00:00:00.000000000 102 126 334.5 14 15 60 1970-01-01 00:00:00.020201226 100 120 250.0 15 16 60 2020-12-27 00:00:00.000000000 92 118 241.0 16 17 60 2020-12-28 00:00:00.000000000 103 132 130.0 17 18 60 2020-12-29 00:00:00.000000000 100 132 280.0 18 19 60 2020-12-30 00:00:00.000000000 102 129 380.3 19 20 60 2020-12-30 00:00:00.000000000 102 129 380.3
Fixing Wrong Data¶
Replacing Values¶
In [60]:
Copied!
df1 = df.copy()
df1.loc[4, 'Duration'] = 45
print(df1)
df1 = df.copy()
df1.loc[4, 'Duration'] = 45
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 45 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 NaN 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3
In [62]:
Copied!
df1 = df.copy()
for x in df1.index:
if df1.loc[x, "Duration"] > 120:
df1.loc[x, "Duration"] = 120
print(df1)
df1 = df.copy()
for x in df1.index:
if df1.loc[x, "Duration"] > 120:
df1.loc[x, "Duration"] = 120
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 120 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 NaN 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3
Removing Rows¶
In [63]:
Copied!
df1 = df.copy()
for x in df.index:
if df1.loc[x, "Duration"] > 120:
df1.drop(x, inplace = True)
print(df1)
df1 = df.copy()
for x in df.index:
if df1.loc[x, "Duration"] > 120:
df1.drop(x, inplace = True)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 NaN 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3 19 20 60 2020/12/30 102 129 380.3
Removing Duplicates¶
Discovering Duplicates¶
In [67]:
Copied!
print(df.duplicated())
print(df.duplicated())
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 True dtype: bool
Removing Duplicates¶
In [68]:
Copied!
df1 = df.copy()
df1.drop_duplicates(inplace = True)
print(df1)
df1 = df.copy()
df1.drop_duplicates(inplace = True)
print(df1)
SN Duration Date Pulse Maxpulse Calories 0 1 60 2020/12/01 110 130 409.1 1 2 60 2020/12/02 117 145 479.0 2 3 60 2020/12/03 103 135 340.0 3 4 45 2020/12/04 109 175 282.4 4 5 450 2020/12/08 104 134 253.3 5 6 45 2020/12/05 117 148 406.0 6 7 60 2020/12/06 102 127 300.0 7 8 60 2020/12/07 110 136 374.0 8 9 30 2020/12/09 109 133 195.1 9 10 45 None 100 119 282.0 10 11 60 2020/12/10 98 124 269.0 11 12 60 2020/12/23 130 101 300.0 12 13 45 2020/12/24 105 132 246.0 13 14 60 2020/12/25 102 126 334.5 14 15 60 20201226 100 120 250.0 15 16 60 2020/12/27 92 118 241.0 16 17 60 2020/12/28 103 132 NaN 17 18 60 2020/12/29 100 132 280.0 18 19 60 2020/12/30 102 129 380.3
Data Correlations¶
Finding Relationships¶
A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each column in your data set.
In [69]:
Copied!
df.corr()
df.corr()
C:\Users\prakr\AppData\Local\Temp\ipykernel_15916\1134722465.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr()
Out[69]:
SN | Duration | Pulse | Maxpulse | Calories | |
---|---|---|---|---|---|
SN | 1.000000 | -0.198346 | -0.417154 | -0.451612 | -0.293618 |
Duration | -0.198346 | 1.000000 | -0.064046 | 0.011792 | -0.168140 |
Pulse | -0.417154 | -0.064046 | 1.000000 | 0.077532 | 0.390848 |
Maxpulse | -0.451612 | 0.011792 | 0.077532 | 1.000000 | 0.244306 |
Calories | -0.293618 | -0.168140 | 0.390848 | 0.244306 | 1.000000 |
In [ ]:
Copied!