不間斷 Python 挑戰 Day 28 - 處理CSV文件

2022/03/07閱讀時間約 46 分鐘
CSV全名為Comma-Separated Values,中文稱為逗號分隔值,也可稱為字元分隔值,因為分隔字元可以不是逗號。它以純文字的形式儲存表格資料,同一列的資料以逗號或其它符號分隔成不同欄位,每一列的資料間以換行符號分隔。網路上很多資料的格式都是以CSV檔案呈現,例如交通部中央氣象局觀測資料查詢系統、或是台灣證券交易所的上市股票交易資訊等。這篇文章會說明如何以Python內建的csv模組來處理CSV檔案。

開啟CSV檔案

台灣證券交易所的上市個股日成交資訊為例,進入台灣證券交易所網站後,點選"交易資訊"中的"個股日成交資訊",進入查詢畫面。
台灣證券交易所
假設我們想要查詢"元大台灣50"在二月份的日成交資訊,可以在股票代碼欄位輸入"元大台灣50"的股票代碼"0050",按下查詢後便可在網頁上顯示出股票的開、高、低、收等資訊,並可點下"CSV下載"的連結,將該表格的CSV檔案下載下來。
個股日成交資訊
111年2月元大台灣50日成交資訊
以記事本開啟下載的檔案,可以看到這個CSV檔案的原始格式。
使用記事本開啟CSV檔
以下是使用Excel開啟檔案的結果。
使用Excel開啟CSV檔

讀取CSV檔案

如同在上一節開啟純文字文件的方式,CSV檔案也適用同樣的方式開啟,搭配with關鍵字,使用open()函數開啟CSV檔案後,可用read()函數將檔案內容讀出來,或是使用readlines()函數將檔案內容存成串列。
with open("STOCK_DAY_0050_202202.csv") as csv_read_file:
  csv_read_data = csv_read_file.readlines()
  print(csv_read_data)
執行結果如下:
['"111年02月 0050 元大台灣50       各日成交資訊"\n', '"日期","成交股數","成交金額","開盤價","最高價","最低價","收盤價","漲跌價差","成交筆數",\n', '"111/02/07","25,312,461","3,593,629,562","142.60","142.80","141.20","142.30","+0.75","27,821",\n', '"111/02/08","11,521,967","1,648,442,209","143.30","143.50","142.80","142.90","+0.60","12,271",\n', '"111/02/09","8,470,497","1,211,943,126","143.50","143.50","142.65","143.40","+0.50","10,362",\n', '"111/02/10","6,325,130","912,680,055","143.90","145.05","143.50","145.05","+1.65","9,065",\n', '"111/02/11","4,556,282","659,321,520","144.70","145.05","144.05","144.95","-0.10","6,376",\n', '"111/02/14","12,907,152","1,841,179,721","143.95","143.95","142.20","142.70","-2.25","23,175",\n', '"111/02/15","8,139,525","1,160,338,475","142.70","143.20","141.95","142.05","-0.65","16,664",\n', '"111/02/16","6,906,308","995,478,507","143.90","144.45","143.80","144.40","+2.35","9,252",\n', '"111/02/17","5,161,916","746,195,906","144.40","145.00","144.00","144.50","+0.10","7,328",\n', '"111/02/18","4,068,622","584,581,991","143.50","144.15","143.05","144.10","-0.40","5,997",\n', '"111/02/21","4,807,904","689,347,421","143.45","143.85","142.80","143.80","-0.30","8,772",\n', '"111/02/22","20,814,988","2,943,619,804","142.60","142.65","140.75","141.85","-1.95","42,631",\n', '"111/02/23","4,803,500","681,745,270","141.85","142.40","141.65","142.25","+0.40","8,142",\n', '"111/02/24","32,725,876","4,565,434,642","141.10","141.10","138.30","138.80","-3.45","67,346",\n', '"111/02/25","24,731,645","3,424,810,229","139.05","139.20","137.85","138.50","-0.30","43,146",\n', '"說明:"\n', '"符號說明:+/-/X表示漲/跌/不比價"\n', '"當日統計資訊含一般、零股、盤後定價、鉅額交易,不含拍賣、標購。"\n', '"ETF證券代號第六碼為K、M、S、C者,表示該ETF以外幣交易。"\n', '\n']
可以注意到若以這種方式讀取檔案,串列的元素是以字串的形式存在,且將每個欄位以及逗號合在一起,難以做進一步的資料處理。在此,我們可以導入csv模組的reader()方法建立reader物件,再使用for迴圈逐行讀出reader物件資料,或是將其轉換為串列,每個欄位就可以被分開,成為串列的一個元素。
  • 使用for迴圈逐行讀出Reader物件資料:
with open("STOCK_DAY_0050_202202.csv") as csv_read_file:
  csv_read_data = csv.reader(csv_read_file)
  for row in csv_read_read_data:
    print(row)
執行結果:
['111年02月 0050 元大台灣50       各日成交資訊']
['日期', '成交股數', '成交金額', '開盤價', '最高價', '最低價', '收盤價', '漲跌價差', '成交筆數', '']
['111/02/07', '25,312,461', '3,593,629,562', '142.60', '142.80', '141.20', '142.30', '+0.75', '27,821', '']
['111/02/08', '11,521,967', '1,648,442,209', '143.30', '143.50', '142.80', '142.90', '+0.60', '12,271', '']
['111/02/09', '8,470,497', '1,211,943,126', '143.50', '143.50', '142.65', '143.40', '+0.50', '10,362', '']
['111/02/10', '6,325,130', '912,680,055', '143.90', '145.05', '143.50', '145.05', '+1.65', '9,065', '']
['111/02/11', '4,556,282', '659,321,520', '144.70', '145.05', '144.05', '144.95', '-0.10', '6,376', '']
['111/02/14', '12,907,152', '1,841,179,721', '143.95', '143.95', '142.20', '142.70', '-2.25', '23,175', '']
['111/02/15', '8,139,525', '1,160,338,475', '142.70', '143.20', '141.95', '142.05', '-0.65', '16,664', '']
['111/02/16', '6,906,308', '995,478,507', '143.90', '144.45', '143.80', '144.40', '+2.35', '9,252', '']
['111/02/17', '5,161,916', '746,195,906', '144.40', '145.00', '144.00', '144.50', '+0.10', '7,328', '']
['111/02/18', '4,068,622', '584,581,991', '143.50', '144.15', '143.05', '144.10', '-0.40', '5,997', '']
['111/02/21', '4,807,904', '689,347,421', '143.45', '143.85', '142.80', '143.80', '-0.30', '8,772', '']
['111/02/22', '20,814,988', '2,943,619,804', '142.60', '142.65', '140.75', '141.85', '-1.95', '42,631', '']
['111/02/23', '4,803,500', '681,745,270', '141.85', '142.40', '141.65', '142.25', '+0.40', '8,142', '']
['111/02/24', '32,725,876', '4,565,434,642', '141.10', '141.10', '138.30', '138.80', '-3.45', '67,346', '']
['111/02/25', '24,731,645', '3,424,810,229', '139.05', '139.20', '137.85', '138.50', '-0.30', '43,146', '']
['說明:']
['符號說明:+/-/X表示漲/跌/不比價']
['當日統計資訊含一般、零股、盤後定價、鉅額交易,不含拍賣、標購。']
['ETF證券代號第六碼為K、M、S、C者,表示該ETF以外幣交易。']
[]
  • 將Reader物件轉換為串列,再使用for迴圈逐行讀出串列內容:
with open("STOCK_DAY_0050_202202.csv") as csv_read_file:
  csv_read_data = csv.reader(csv_read_file)
  csv_read_list = list(csv_read_data)
  for row in csv_read_list:
    print(row)
執行得到相同的結果:
['111年02月 0050 元大台灣50       各日成交資訊']
['日期', '成交股數', '成交金額', '開盤價', '最高價', '最低價', '收盤價', '漲跌價差', '成交筆數', '']
['111/02/07', '25,312,461', '3,593,629,562', '142.60', '142.80', '141.20', '142.30', '+0.75', '27,821', '']
['111/02/08', '11,521,967', '1,648,442,209', '143.30', '143.50', '142.80', '142.90', '+0.60', '12,271', '']
['111/02/09', '8,470,497', '1,211,943,126', '143.50', '143.50', '142.65', '143.40', '+0.50', '10,362', '']
['111/02/10', '6,325,130', '912,680,055', '143.90', '145.05', '143.50', '145.05', '+1.65', '9,065', '']
['111/02/11', '4,556,282', '659,321,520', '144.70', '145.05', '144.05', '144.95', '-0.10', '6,376', '']
['111/02/14', '12,907,152', '1,841,179,721', '143.95', '143.95', '142.20', '142.70', '-2.25', '23,175', '']
['111/02/15', '8,139,525', '1,160,338,475', '142.70', '143.20', '141.95', '142.05', '-0.65', '16,664', '']
['111/02/16', '6,906,308', '995,478,507', '143.90', '144.45', '143.80', '144.40', '+2.35', '9,252', '']
['111/02/17', '5,161,916', '746,195,906', '144.40', '145.00', '144.00', '144.50', '+0.10', '7,328', '']
['111/02/18', '4,068,622', '584,581,991', '143.50', '144.15', '143.05', '144.10', '-0.40', '5,997', '']
['111/02/21', '4,807,904', '689,347,421', '143.45', '143.85', '142.80', '143.80', '-0.30', '8,772', '']
['111/02/22', '20,814,988', '2,943,619,804', '142.60', '142.65', '140.75', '141.85', '-1.95', '42,631', '']
['111/02/23', '4,803,500', '681,745,270', '141.85', '142.40', '141.65', '142.25', '+0.40', '8,142', '']
['111/02/24', '32,725,876', '4,565,434,642', '141.10', '141.10', '138.30', '138.80', '-3.45', '67,346', '']
['111/02/25', '24,731,645', '3,424,810,229', '139.05', '139.20', '137.85', '138.50', '-0.30', '43,146', '']
['說明:']
['符號說明:+/-/X表示漲/跌/不比價']
['當日統計資訊含一般、零股、盤後定價、鉅額交易,不含拍賣、標購。']
['ETF證券代號第六碼為K、M、S、C者,表示該ETF以外幣交易。']
[]

使用索引讀取串列內容

假設我們想讀取"元大台灣50"在二月份每天的收盤價,觀察一下上方資料的型態,可以看到資料的最前和最後有一些不屬於成交資訊的內容,因此對資料做一些簡易的判斷,捨去長度為1的列,即可用"日期"及"收盤價"所在的索引值讀出資料內容。
with open("STOCK_DAY_0050_202202.csv") as csv_read_file:
  csv_read_data = csv.reader(csv_read_file)
  csv_read_list = list(csv_read_data)
  for row in range(len(csv_read_list)-1):
    if len(csv_read_list[row]) > 1:
      print(csv_read_list[row][0], csv_read_list[row][6])
執行結果:
日期 收盤價
111/02/07 142.30
111/02/08 142.90
111/02/09 143.40
111/02/10 145.05
111/02/11 144.95
111/02/14 142.70
111/02/15 142.05
111/02/16 144.40
111/02/17 144.50
111/02/18 144.10
111/02/21 143.80
111/02/22 141.85
111/02/23 142.25
111/02/24 138.80
111/02/25 138.50

寫入CSV檔案

相對於讀取CSV檔案,寫入CSV檔案時,我們需要使用csv模組的writer()方法建立一個writer物件,再使用writerow()或writerows()方法逐行或一次寫入多行資料到CSV檔案。
以下範例將建立一個CSV檔案,將我們剛剛下載的"元大台灣50"在二月份的日成交資訊複製過來,其中newline=""的目的是避免每一列之間多出一個空白列。
with open("STOCK_DAY_0050_202202_copy.csv", mode="w", newline="") as csv_write_file:
  csv_write_data = csv.writer(csv_write_file)
  csv_write_data.writerows(csv_read_list)
執行過後,在主目錄底下會多出一個STOCK_DAY_0050_202202_copy.csv檔案,內容和STOCK_DAY_0050_202202.csv相同。
複製的CSV

設定分隔符號:delimiter關鍵字

前面提到,CSV檔案的分隔字元可以不是逗號,在寫入CSV檔案時,需要在writer()方法內用到delimiter關鍵字,如以下範例使用空格" "做為分隔符號。
with open("STOCK_DAY_0050_202202_delimiter.csv", mode="w", newline="") as csv_write_file:
  csv_write_data = csv.writer(csv_write_file, delimiter=" ")
  csv_write_data.writerows(csv_read_list)
可以看到欄位之間的分隔不再是預設的逗點,而是變成空格。
使用空格做為欄位分隔符號
在讀取分隔字元不是逗點的CSV檔案時,reader()方法也要做對應的設定,否則就會出現非預期的結果,如以下的錯誤範例,它在讀取前面使用空格分隔的CSV檔案"STOCK_DAY_0050_202202_delimiter.csv"時,把數字內的逗點當成了分隔點。
with open("STOCK_DAY_0050_202202_delimiter.csv") as csv_read_file:
  csv_read_data = csv.reader(csv_read_file)
  csv_read_list = list(csv_read_data)
  for row in csv_read_list:
    print(row)
執行結果:
['111年02月 0050 元大台灣50       各日成交資訊']
['日期 成交股數 成交金額 開盤價 最高價 最低價 收盤價 漲跌價差 成交筆數 ']
['111/02/07 25', '312', '461 3', '593', '629', '562 142.60 142.80 141.20 142.30 +0.75 27', '821 ']
['111/02/08 11', '521', '967 1', '648', '442', '209 143.30 143.50 142.80 142.90 +0.60 12', '271 ']
['111/02/09 8', '470', '497 1', '211', '943', '126 143.50 143.50 142.65 143.40 +0.50 10', '362 ']
['111/02/10 6', '325', '130 912', '680', '055 143.90 145.05 143.50 145.05 +1.65 9', '065 ']
['111/02/11 4', '556', '282 659', '321', '520 144.70 145.05 144.05 144.95 -0.10 6', '376 ']
['111/02/14 12', '907', '152 1', '841', '179', '721 143.95 143.95 142.20 142.70 -2.25 23', '175 ']
['111/02/15 8', '139', '525 1', '160', '338', '475 142.70 143.20 141.95 142.05 -0.65 16', '664 ']
['111/02/16 6', '906', '308 995', '478', '507 143.90 144.45 143.80 144.40 +2.35 9', '252 ']
['111/02/17 5', '161', '916 746', '195', '906 144.40 145.00 144.00 144.50 +0.10 7', '328 ']
['111/02/18 4', '068', '622 584', '581', '991 143.50 144.15 143.05 144.10 -0.40 5', '997 ']
['111/02/21 4', '807', '904 689', '347', '421 143.45 143.85 142.80 143.80 -0.30 8', '772 ']
['111/02/22 20', '814', '988 2', '943', '619', '804 142.60 142.65 140.75 141.85 -1.95 42', '631 ']
['111/02/23 4', '803', '500 681', '745', '270 141.85 142.40 141.65 142.25 +0.40 8', '142 ']
['111/02/24 32', '725', '876 4', '565', '434', '642 141.10 141.10 138.30 138.80 -3.45 67', '346 ']
['111/02/25 24', '731', '645 3', '424', '810', '229 139.05 139.20 137.85 138.50 -0.30 43', '146 ']
['說明:']
['符號說明:+/-/X表示漲/跌/不比價']
['當日統計資訊含一般、零股、盤後定價、鉅額交易,不含拍賣、標購。']
['ETF證券代號第六碼為K、M、S、C者,表示該ETF以外幣交易。']
[]
加上正確的delimiter關鍵字即可正確讀出CSV檔案內容。
with open("STOCK_DAY_0050_202202_delimiter.csv") as csv_read_file:
  csv_read_data = csv.reader(csv_read_file, delimiter=" ")
  csv_read_list = list(csv_read_data)
  for row in csv_read_list:
    print(row)
執行結果:
['111年02月 0050 元大台灣50       各日成交資訊']
['日期', '成交股數', '成交金額', '開盤價', '最高價', '最低價', '收盤價', '漲跌價差', '成交筆數', '']
['111/02/07', '25,312,461', '3,593,629,562', '142.60', '142.80', '141.20', '142.30', '+0.75', '27,821', '']
['111/02/08', '11,521,967', '1,648,442,209', '143.30', '143.50', '142.80', '142.90', '+0.60', '12,271', '']
['111/02/09', '8,470,497', '1,211,943,126', '143.50', '143.50', '142.65', '143.40', '+0.50', '10,362', '']
['111/02/10', '6,325,130', '912,680,055', '143.90', '145.05', '143.50', '145.05', '+1.65', '9,065', '']
['111/02/11', '4,556,282', '659,321,520', '144.70', '145.05', '144.05', '144.95', '-0.10', '6,376', '']
['111/02/14', '12,907,152', '1,841,179,721', '143.95', '143.95', '142.20', '142.70', '-2.25', '23,175', '']
['111/02/15', '8,139,525', '1,160,338,475', '142.70', '143.20', '141.95', '142.05', '-0.65', '16,664', '']
['111/02/16', '6,906,308', '995,478,507', '143.90', '144.45', '143.80', '144.40', '+2.35', '9,252', '']
['111/02/17', '5,161,916', '746,195,906', '144.40', '145.00', '144.00', '144.50', '+0.10', '7,328', '']
['111/02/18', '4,068,622', '584,581,991', '143.50', '144.15', '143.05', '144.10', '-0.40', '5,997', '']
['111/02/21', '4,807,904', '689,347,421', '143.45', '143.85', '142.80', '143.80', '-0.30', '8,772', '']
['111/02/22', '20,814,988', '2,943,619,804', '142.60', '142.65', '140.75', '141.85', '-1.95', '42,631', '']
['111/02/23', '4,803,500', '681,745,270', '141.85', '142.40', '141.65', '142.25', '+0.40', '8,142', '']
['111/02/24', '32,725,876', '4,565,434,642', '141.10', '141.10', '138.30', '138.80', '-3.45', '67,346', '']
['111/02/25', '24,731,645', '3,424,810,229', '139.05', '139.20', '137.85', '138.50', '-0.30', '43,146', '']
['說明:']
['符號說明:+/-/X表示漲/跌/不比價']
['當日統計資訊含一般、零股、盤後定價、鉅額交易,不含拍賣、標購。']
['ETF證券代號第六碼為K、M、S、C者,表示該ETF以外幣交易。']
[]

程式範例

為什麼會看到廣告
Wei-Jie Weng
Wei-Jie Weng
留言0
查看全部
發表第一個留言支持創作者!