Remove duplicate columns by name in Pandas
Pandas
import pandas as pd
import numpy as np
Create a dataframe
#create a dataframe
raw_data = {'name': ['Willard Morris', 'Al Jennings'],
'age': [20, 19],
'favorite_color': ['blue', 'red'],
'grade': [88, 92],
'grade': [88, 92]}
df = pd.DataFrame(raw_data, index = ['Willard Morris', 'Al Jennings'])
df
age | favorite_color | grade | grade | name | |
---|---|---|---|---|---|
Willard Morris | 20 | blue | 88 | 88 | Willard Morris |
Al Jennings | 19 | red | 92 | 92 | Al Jennings |
Remove duplicate columns (based on column name)
#preview the df
df = df.loc[:,~df.columns.duplicated()]
df
age | favorite_color | grade | name | |
---|---|---|---|---|
Willard Morris | 20 | blue | 88 | Willard Morris |
Al Jennings | 19 | red | 92 | Al Jennings |
df.columns.duplicated() returns a boolean array: a True or False for each column--False means the column name is unique up to that point, True means it's a duplicate
Pandas allows one to index using boolean values whereby it selects only the True values.
Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, False, True] -> ~[False,True,False])