題目會給定一個pandas DataFrame作為輸入,要求我們以原有的資料表email欄位為比較基準,刪除重複的列 data rows,只保留最早第一次出現的。
Example 1:
Input:
+-------------+---------+---------------------+
| customer_id | name | email |
+-------------+---------+---------------------+
| 1 | Ella | emily@example.com |
| 2 | David | michael@example.com |
| 3 | Zachary | sarah@example.com |
| 4 | Alice | john@example.com |
| 5 | Finn | john@example.com |
| 6 | Violet | alice@example.com |
+-------------+---------+---------------------+
Output:
+-------------+---------+---------------------+
| customer_id | name | email |
+-------------+---------+---------------------+
| 1 | Ella | emily@example.com |
| 2 | David | michael@example.com |
| 3 | Zachary | sarah@example.com |
| 4 | Alice | john@example.com |
| 6 | Violet | alice@example.com |
+-------------+---------+---------------------+
Explanation:
Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.
john@example.com 在原本的資料表中的email欄位出現重複,我們只保留第一筆最早出現的。
| 4 | Alice | john@example.com |
這一題乍看簡單,但是實際上考的是細心度。
通常有接觸過pandas的同學,直覺就會想到去除重複,那就呼叫df.drop_duplicates()來回傳答案。
確實,這題的確是使用這個內建function,但是要記得帶入正確的參數,在題目指定的欄位email上作為比較基準,才是真正正確的答案。
import pandas as pd
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
# Drop duplicates on based on "email" column
customers.drop_duplicates(subset='email', keep='first', inplace=True)
return customers
再次提醒,這題直接呼叫customers.drop_duplicates()使用默認參數是錯的喔。
時間複雜度:
需要從上到下掃描data row,並且以email欄位為比較基準,去刪除重複的data row,所需時間為O(n)。
空間複雜度:
最差情況下,就是每一筆資料都沒有重複,那麼最後資料表還是和原本的一樣大,所需空間為O(n)。
若是第一次學習df.drop_duplicates()的同學,可以仔細觀察不同參數下,所帶來的差異。
範例程式碼:
>>>df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
# By default, it removes duplicate rows based on all columns.
>>>df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
# To remove duplicates on specific column(s), use subset.
>>>df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
# To remove duplicates and keep last occurrences, use keep.
>>>df.drop_duplicates(subset=['brand', 'style'], keep='last')
brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
Reference: