[Python]代替Re正則表達式來處理String 字串的簡單方法

螃蟹_crab

發佈於Python[基礎][應用][相關]

更新於 2024/08/17發佈於 2024/08/17閱讀時間約 8 分鐘

有別於正則表達式符號的難以理解，有其他簡單的方式也可以來處理字串的問題。

本文主要介紹其他方法用於處理字串，尋找字串中的關鍵字與其他操作。

[Python]Re正則表達式中常用符號的重點整理

字串方法（String Methods）

Python 提供了多種內建的字串方法，可以替代某些簡單的正則表達式功能。

str.find() 和 str.index(): 查找字串中的位置。

text = "Hello, world!"
position = text.find("world")  # 返回符合 "world" 的起始索引
print(f'找到的起始位置: {position}') # 找到的起始位置: 7

str.startswith() 和 str.endswith(): 檢查字串是否以設定的字串開始或結束。

text = "Hello, world!"
result = text.startswith("Hello")  # 返回 True
if result:
    print(f'從Hello開始') # 從Hello開始

str.replace(): 替換字串中的字元。

text = "Hello, world!"
new_text = text.replace("world", "Python")  # 替換 "world" 為 "Python"
print(new_text) # Hello, Python!

str.split() 和 str.join(): 分割和合併字串。

text = "apple,banana,orange"
fruits = text.split(",")  # 分割為列表 ["apple", "banana", "orange"] 
print(fruits) #['apple', 'banana', 'orange']

str.count(): 計算字元在字串中出現的次數。

text = "banana"
count = text.count("a")  # 返回 3
print(count) #3

Pandas

對於處理大量數據或更複雜的字串操作，Pandas 提供了很多功能。特別是 str 類方法，這些方法可以應用於整個 DataFrame 或 Series。

str.contains(): 檢查 Series 中每個元素是否包含特定的字串。

import pandas as pd
df = pd.DataFrame({"text": ["apple", "banana", "cherry"]})
#新增一列命名contains_a，並將結果加在這一列上
df["contains_a"] = df["text"].str.contains("a")
#轉換回字典的形式
df_dict = df.to_dict(orient="records")
print(df_dict)
# 輸出 [{'text': 'apple', 'contains_a': True}, {'text': 'banana', 'contains_a': True}, {'text': 'cherry', 'contains_a': False}]

df結果

資料更新，新增contains_a這一列包含是否包含'a'的結果。

     text  contains_a
0   apple        True
1  banana        True
2  cherry       False

str.extract(): 提取匹配給定正則表達式。

import pandas as pd
df = pd.DataFrame({"text": ["apple", "banana", "cherry"]})
df["extracted"] = df["text"].str.extract(r"(a\w+)")
#轉換回字典的形式
df_dict = df.to_dict(orient="records")
print(df)

str.extract(): 這是一個 pandas 提供的方法，用於從 Series 中的每個元素中提取匹配給定正則表達式的字串。它返回的是一個 DataFrame，即使只有一個提取的分組。
r"(a\w+)":
- r"" 是用於表示原始字串（raw string）的前綴，這樣 Python 不會對字串中的反斜杠進行轉義處理。
- (a\w+) 是一個正則表達式，其中：
- - a 匹配字母 "a"。\w+ 匹配緊跟在 "a" 後的一個或多個字母或數字（\w 匹配字母、數字和下劃線字符，+ 表示匹配一次或多次）。

df結果

     text extracted
0   apple     apple
1  banana     anana
2  cherry       NaN

結果分析：

"apple": 提取到 "apple" 中的 "a" 和後續的 "pple"，結果為 "apple"。
"banana": 提取到 "anana"，因為這是第一個 "a" 開始後的匹配。
"cherry": 不包含字母 "a"，因此結果為 NaN（無匹配結果）。

List Comprehensions 串列生成式

對於簡單的匹配和替換操作，串列生成式可以非常直觀。

過濾列表中的元素:

words = ["apple", "banana", "cherry", "date"]
filtered = [word for word in words if "a" in word]  # 包含 "a" 的單詞

print(filtered) #輸出 ['apple', 'banana', 'date']

修改列表中的元素:

words = ["apple", "banana", "cherry", "date"]
modified = [word.upper() for word in words]  # 將所有單詞轉為大寫

print(modified) #輸出 ['APPLE', 'BANANA', 'CHERRY', 'DATE']