經過前面兩篇環境建置和資料處理,接下來要進到圖表的製作階段
前情提要:
視覺化 Walmart 財報- 01 (安裝環境、檔案基本介紹)
Walmart 財報視覺化分析 - 02 轉成 Pandas DataFrame
首先我們先把前面整理好的數據淨利營收等分類。藍色的是Total_Revenue
這些是 財務數據的分類,每個字典代表一個財務指標,例如:
我們要做的是藍色這一塊。
Segments = { #group 起來之後要塗成藍色)
'Membership': round(df_Revenue_b1, 2), #取小數到第二位
'U.S.': round(df_Revenue_b2, 2),
'International': round(df_Revenue_b3, 2),
'Sam Club': round(df_Revenue_b4, 2)
}
儲存 key 和 value,舉例來說,如果我輸入print(Segments[Membersip])
會輸出df_Revenue_b1
的值
{"key": "value"}
比 list [ ] 好在,有key 所以可以更快查找,而且也可以移除重複值。
df_Revenue_b1
= 會員收入df_Revenue_b2
= 美國市場收入df_Revenue_b3
= 國際市場收入df_Revenue_b4
= Sam’s Club 收入這些變數的數值來自 之前讀取的 Excel 財報數據。
其他財務指標:
Total_Revenue = {'Total Revenue': round(df_Total_Revenue, 2)}
COGS = {'COGS': round(df_Total_COGS, 2)}
GP = {'Gross Profit': round(df_GP, 2)}
Operating_Income = {'Operating Income': round(0, 2)}
df_Total_Revenue
= 總收入df_Total_COGS
= 總銷貨成本(Cost of Goods Sold)df_GP
= 毛利(Gross Profit)df_OP
= 營業利潤(Operating Profit)📌 這些字典(dictionary) 的目的是整理數據,以便後續畫圖時使用。
if df_OP >= 0: #如果大於0
OP = {'Operating Profit': round(df_OP, 2)} #大於0就存入profit
OL = {'Operating Loss': round(0, 2)}
else:
OP = {'Operating Profit': round(0, 2)}
OL = {'Operating Loss': round(-df_OP, 2)} # 小於0就存入loss
label
是什麼?解釋list(Segments.keys())label = ( list(Segments.keys()) #.key 是拿取資料的意思
+ list(Total_Revenue.keys())
+ list(COGS.keys())
+ list(GP.keys())
+ list(OP.keys())
+ list(Total_Operating_Expenses.keys())
+ list(Operating_Expenses.keys())
+ list(Net_Interest_Income.keys())
+ list(Net_Interest_Expense.keys())
+ list(Pretax_Profit.keys())
+ list(Tax_Expense.keys())
+ list(AfterTax_Revenue.keys())
+ list(Net_Profit.keys())
)
這一行程式碼的作用是:
📌 label
是一個包含所有財務指標名稱的列表list,例如:
['Membership', 'U.S.', 'International', 'Sam Club',
'Total Revenue', 'COGS', 'Gross Profit', 'Operating Profit', 'Pretax Profit', 'Tax Expense', 'Net Profit']
像是這裡 U.S. 的節點
這些 將作為 Sankey 圖的節點名稱。
提取值則用 list(Segments.values()), label_value
是一個包含財報數據數值的列表,例如:
[10.5, 50.2, 20.1, 15.3, 110.4, 60.8, 49.6, 30.2, 25.1, 5.0, 20.1]
這些數值將 對應到 Sankey 圖的節點,表示每個財務指標的金額。
問題:為甚麼不一開始就存成 list[] 格式?
如果存成list 格式搜尋不僅變慢(從第一個開始找值),而且也無法使用 key 搜尋,
Segments = ['Membership', 10.5, 'U.S.', 50.2, 'International', 20.1, 'Sam Club', 15.3]
label_combined
組合成「名稱 - 金額」的格式label_value_strings = [str(x) for x in label_value]
label_combined = [i + ' - ' + '$' + j + 'B' for i, j in zip(label, label_value_strings)]
這段程式碼:
label_value
轉成字串Total Revenue - $110.4B
📌 這樣做的目的是讓圖表標籤更清楚,顯示數值資訊。
source
和 target
是什麼?Membership → Total Revenue
U.S. → Total Revenue
International → Total Revenue
Sam Club → Total Revenue
Total Revenue → COGS
Total Revenue → Gross Profit
Gross Profit → Operating Expenses
Gross Profit → Pretax Profit
source = ( [label.index(sm) for sm in Segments.keys()]
+ [label.index(tr) for tr in Total_Revenue.keys()] * 2
+ [label.index(gp) for gp in GP.keys()] * (len(Total_Operating_Expenses))
+ [label.index(oe) for oe in Total_Operating_Expenses.keys()] * len(Operating_Expenses)
+ [label.index(gp) for gp in GP.keys()]
+ [label.index(nii) for nii in Net_Interest_Income.keys()]
+ [label.index(OP) for OP in OP.keys()]
+ [label.index(pp) for pp in Pretax_Profit.keys()]
+ [label.index(pp) for pp in Pretax_Profit.keys()]
+ [label.index(at) for at in AfterTax_Revenue.keys()]
)
target = ( [label.index(tr) for tr in Total_Revenue.keys()] * len(Segments)
+ [label.index(cogs) for cogs in COGS.keys()]
+ [label.index(gp) for gp in GP.keys()]
+ [label.index(oe) for oe in Total_Operating_Expenses.keys()]
+ [label.index(oe) for oe in Operating_Expenses.keys()]
+ [label.index(OP) for OP in OP.keys()] * (len(GP) + len(Net_Interest_Income))
+ [label.index(pp) for pp in Pretax_Profit.keys()] * (len(OP) )
+ [label.index(te) for te in Tax_Expense.keys()]
+ [label.index(np) for np in Net_Profit.keys()]
+ [label.index(np) for np in Net_Profit.keys()]
)
📌 這段程式碼的作用是
label.index()
找到每個節點的索引[label.index(sm) for sm in Segments.keys()] #掃過每個key 並存成index
例如:
[0, 1, 2, 3, 4, 5, 6] #對應 membersip...gross profit
用 print 來看會是
[0, 1, 2, 3, 4, 4, 6, 8, 8, 6, 11, 7, 13, 13, 15] #Source 對上 target
[4, 4, 4, 4, 5, 6, 8, 9, 10, 7, 7, 13, 14, 16, 16]
Membership
(0)流向 Total Revenue
(4)U.S.
(1)流向 Total Revenue
(4)International
(2)流向 Total Revenue
(4)這些索引將用來繪製 Sankey 圖中的 箭頭(流向)。
value
是什麼?value = (
list(Segments.values())
+ list(COGS.values())
+ list(GP.values())
+ list(Total_Operating_Expenses.values())
+ list(Operating_Expenses.values())
+ list(OP.values())
+ list(Net_Interest_Income.values())
+ list(Pretax_Profit.values())
+ list(Tax_Expense.values())
+ list(Net_Profit.values())
+ list(AfterTax_Revenue.values()) )
📌 這段程式碼的作用是
source
和 target
,代表每條箭頭的數值這些數值將影響 Sankey 圖的箭頭寬度。
label
→ 節點名稱label_value
→ 節點的金額source
→ 箭頭的起點target
→ 箭頭的終點value
→ 箭頭的寬度(流動的金額)import plotly.graph_objects as go
import plotly.graph_objects as go #使用plotly.graph_objects這個函式庫
✅ go.Sankey()
將這個函式庫存成「go」,Sankey 則是執行圖表
不一定要用作者的顏色,可以到這個網站選擇自己喜歡的顏色
color = (
['rgba(173, 216, 230, 0.8)'] * len(Segments) # Light Blue for Revenue Sources
+ ['rgba(250, 128, 114, 0.8)'] * len(COGS) # Salmon for COGS
+ ['rgba(144, 238, 144, 0.8)'] * len(GP) # Light Green for Gross Profit
+ ['rgba(250, 128, 114, 0.8)'] * len(Total_Operating_Expenses) # Salmon for Total Operating Expenses
+ ['rgba(250, 128, 114, 0.8)'] * len(Operating_Expenses) # Salmon for Sub Operating Expenses
+ ['rgba(144, 238, 144, 0.8)'] * (len(OP) + len(Net_Interest_Income)) # Light Green for Operating Profit & Net Interest Income
+ ['rgba(144, 238, 144, 0.8)'] * (len(Pretax_Profit)) # Light Green for Pretax Profit
+ ['rgba(230, 60, 88, 1)'] * len(Tax_Expense) # Red for Tax Expense
+ ['rgba(144, 238, 144, 0.8)'] * len(Net_Profit) # Light Green for Net Profit
+ ['rgba(144, 238, 144, 0.8)'] * len(AfterTax_Revenue) # Light Green for After-Tax Revenue
)
為什麼顏色要用 len()
?
len是指 Segments 這個字典裡所涵蓋的key 數量。例如
Segments = {"Membership": 10.5, "U.S.": 50.2, "International": 20.1, "Sam Club": 15.3}
print(len(Segments)) # 會得到4總共有4個key
link
連結流向和 node
字典link
)link = dict(source = source, target = target, value = value, color = color)
做一個字典 (dictionary) 名叫 link。這裡的dict 用法其實跟一開始的{}是一樣的,能看出差別嗎?所以也可以寫成以下模式。link 字典 在最後一part生成圖表會用到
link = {
"source": source, #賦值source的值,並命名成source
"target": target,
"value": value,
"color": color
}
node
)node = dict(label = label_combined, pad=50, thickness=10, color = 'grey')
✅ label_combined
所有的label 名字
✅ pad=50
節點跟節點之間的距離
✅ thickness=10
節點的寬度
✅ color='grey'
節點顏色
data = go.Sankey(link = link, node=node)
fig = go.Figure(data) # .figure 是生成圖表
fig.update_layout(font_size=20) #調字體
fig.show() # show 打印出來
✅ 使用Sankey 的功能 (go.Sankey()
).
原始語法規定是,要描述link 跟 node,現在前面已經有用link變數取代一連串dict() 了
import plotly.graph_objects as go
data = go.Sankey(
link=dict(
source=[...], # List of source node indexes
target=[...], # List of target node indexes
value=[...], # List of flow values (thickness of connections)
color=[...] # (Optional) List of colors for each flow
),
node=dict(
label=[...], # List of node names
color=[...] # (Optional) List of node colors
)
)
終於生成圖表了,如果要微調位置,可以在colab 手動拉 🥰
以上就是逐步拆解程式碼的學習筆記!如果再讓我重學一次,我想我會從最後面開始了解 Sankey 圖的 Syntax 需要哪些資料和變數,再逐度推演回去。