2025-03-26发表2025-04-01更新技术学习 / 深度学习10 分钟读完 (大约1514个字)0次访问

Precision Recall and F1-score

对机器学习，特别是分类任务中常见的评价指标进行介绍,附带Pytorch代码。

Precision，Recall和F1-score。

基础知识

混淆矩阵

对分类任务（二分类或多分类）而言，混淆矩阵用于分类模型的性能。

假设我们有N个类别，对于样本a建立混淆矩阵，若对A来说正确的类别应该是A，那么对于a来说类别A就是positive类（正类），而其他类对于样本a来说就是negative类（负类）。

若a被正确分到了A，那结果就是True。反之为False。

那么对于该类就有：

TP（True Positive）：模型正确预测为正类的样本数
FP（False Positive）：模型错误地将负类预测为正类的样本数（也叫：假阳性）
TN（True Negative）：模型正确预测为负类的样本数。
FN（False Ngative）：模型错误地将正类预测为负类的样本数（也叫"假阴性"）。

	实际正类 (Positive)	实际负类 (Negative)
预测正类 (Positive)	True Positive (TP)	False Positive (FP)
预测负类 (Negative)	False Negative (FN)	True Negative (TN)

在医学任务中，假阳性会对患者造成不必要的负担，重大疾病应尽量减少这种错误。假阴性则比假阳性更为严重，可能耽误患者治疗，导致并且恶化，因此医学任务中通常更加关注降低假阴性的发生率。

python的scikit-learn代码实现，并使用matplotlib可视化结果：

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 假设有一些预测值和真实标签
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]  # 实际标签
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]  # 预测标签

# 计算混淆矩阵
conf_matrix = confusion_matrix(y_true, y_pre)
# 可视化混淆矩阵
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()

Precision 精确度

$$
Precision = \frac{TP}{TP+FP}
$$

预测为正类的结果中，预测对了确实就是正类的比例。

1
2
3

from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)

如果你的任务非常注重精准预测，比如在医疗诊断中，不允许误报阳性（例如误判健康人有病），那么高 precision 是非常好的。

Recall

$$
Recall = \frac{TP}{TP+FN}
$$

1
2
3

from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)

预测正确是这个正类的样本量，占整个测试集中正类样本的比例。

如果你的任务更关心全面检测，比如在安检中，要尽可能检测到所有危险物品，哪怕多一些误报也无所谓，那么低 recall 是不理想的，因为你漏掉了一半的危险物品。

F1-score

F1-score 是 precision 和 recall 的调和平均数，它综合了两者的表现，能够更好地衡量模型的整体性能，特别是在类别不平衡或者既关心精度又关心召回的情况下。

$$
F1-score = 2\times \frac{precision\times recall}{precision+recall}
$$

1 2	from sklearn.metrics import f1_score f1 = f1_score(y_true, y_pred)

F1-score 的值介于 0 和 1 之间，越接近 1 表示模型在 precision 和 recall 上的表现越好。

如果任务要求既要准确，又不能漏掉太多正类样本，那么 F1-score 可以作为一个有效的综合衡量指标。

多分类任务中

计算Precision、Recall、F1-score有下不同的处理方式，区别在于如何处理各个类别的权重和汇总。

Micro 平均 (Micro-Averaging)

将所有类别TP、FP、FN累加，再计算Precision、Recall、F1。

适合情况：当类别不均衡时，Micro 会给每个样本相同的权重，因此它更关注全局的准确性，而不是个别类别的表现。

Macro 平均 (Macro-Averaging)

先计算每个类别的 Precision、Recall 和 F1-score，然后对所有类别取平均值。

适合情况：当你想要观察模型在所有类别上的均匀表现时，Macro 更合适。它不会考虑类别的样本数量，因此容易受到少数类别表现的影响。

Weighted 平均 (Weighted-Averaging)

对每个类别的 Precision、Recall 和 F1-score 进行加权平均，权重为该类别的样本数量。因此，较大的类别会对结果有更大的影响。

适合情况：当类别分布不均衡且你希望考虑到类别样本数量的影响时，Weighted 平均能更好地反映整体表现。

代码总结：

# 导入库
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
import numpy as np

# 示例数据 (多分类任务)
y_true = [0, 1, 2, 2, 1, 0, 1, 2, 0, 0, 1, 1]  # 实际标签
y_pred = [0, 2, 1, 2, 1, 0, 0, 2, 0, 0, 2, 1]  # 预测标签

# 计算混淆矩阵
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# 计算 micro, macro, weighted 平均的 Precision, Recall, F1-score
precision_micro = precision_score(y_true, y_pred, average='micro')
recall_micro = recall_score(y_true, y_pred, average='micro')
f1_micro = f1_score(y_true, y_pred, average='micro')

precision_macro = precision_score(y_true, y_pred, average='macro')
recall_macro = recall_score(y_true, y_pred, average='macro')
f1_macro = f1_score(y_true, y_pred, average='macro')

precision_weighted = precision_score(y_true, y_pred, average='weighted')
recall_weighted = recall_score(y_true, y_pred, average='weighted')
f1_weighted = f1_score(y_true, y_pred, average='weighted')

# 打印结果
print(f'Precision (Micro): {precision_micro}')
print(f'Recall (Micro): {recall_micro}')
print(f'F1-Score (Micro): {f1_micro}')

print(f'Precision (Macro): {precision_macro}')
print(f'Recall (Macro): {recall_macro}')
print(f'F1-Score (Macro): {f1_macro}')

print(f'Precision (Weighted): {precision_weighted}')
print(f'Recall (Weighted): {recall_weighted}')
print(f'F1-Score (Weighted): {f1_weighted}')

average 参数在 precision_score()、recall_score() 和 f1_score() 中指定如何计算平均值：

micro: 对所有类别的样本进行全局统计。
macro: 计算每个类别的指标后，再取平均。
weighted: 加权平均，权重为每个类别的样本数量。

Precision Recall and F1-score

https://zhouwentong7.github.io/2025/03/26/Precision-Recall-and-F1-score/

作者

Zhou

发布于

2025-03-26

更新于

2025-04-01

许可协议

Precision Recall and F1-score

基础知识

混淆矩阵

Precision 精确度

Recall

F1-score

多分类任务中

Micro 平均 (Micro-Averaging)

Macro 平均 (Macro-Averaging)

Weighted 平均 (Weighted-Averaging)

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

最新文章

分类

标签

目录