我将考虑以下有监督的学习分类算法:逻辑回归,支持向量机(SVM),K-Nearest邻居(KNN),幼稚的湾,幼稚的湾,决策树,随机森林和极为随机的树(也称为外部树) 。我将在Python和R。
中实施它们使用Python
我将使用称为Scikit-Learn的软件包,这很容易学习,并且对不主修ML的开发人员也很友好。
首先,我导入要使用的数据集-IRIS数据集。
from sklearn.datasets import load_iris
请注意,如果您想使用未内置的数据集,则写
import pandas as pd
data = pd.read_csv('filename.csv', sep='symbol')
sep 通常是逗号,',',但也可以是一个不同的符号。您可以很好地研究数据。
接下来,我导入包含要使用的模型的类。
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
最后,我导入函数 train_test_split 用于将数据集拆分为训练和测试集,并且功能 cross_val_score 用于找到交叉验证的模型精度。
from sklearn.model_selection import train_test_split, cross_val_score
现在,我需要获得数据
data = load_iris()
# You can check the structure of the data
print(data)
然后,我将数据分离为功能(x)和目标(y)
x = data.data
y = data.target
接下来,我将数据分为X和Y的测试和训练集。我将原始数据集的80%作为培训数据,其余20%为测试数据。
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)
现在我可以适合模型。请注意,预测是给定x_test的预测值的一维numpy阵列。还要注意,test_accuracy实际上是基于测试数据的模型预测的准确性,而cross_validated_accuracy是基于培训数据的多个子集的模型预测的准确性(我通过设置5个子集(我使用5个训练数据子集<<强> cv = 5在 cross_validation_accuracy )中。
1.逻辑回归
# logistic
lr_model = LogisticRegression(max_iter=500)
lr_model.fit(x_train, y_train)
predicted_y = lr_model.predict(x_test)
print(predicted_y)
cross_validated_accuracy = cross_val_score(lr_model, X=x_train, y=y_train, cv=5)
print(cross_validated_accuracy)
test_accuracy = lr_model.score(x_test, y_test)
print(round(test_accuracy, 4))
2. SVM
svm_model = SVC()
svm_model.fit(x_train, y_train)
predicted_y = svm_model.predict(x_test)
print(predicted_y)
test_accuracy = svm_model.score(x_test, y_test)
print(round(test_accuracy, 4))
3. knn
knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)
predicted_y = knn_model.predict(x_test)
print(predicted_y)
test_accuracy = knn_model.score(x_test, y_test)
print(round(test_accuracy, 4))
4.天真的贝叶斯
nb_model = MultinomialNB()
nb_model.fit(x_train, y_train)
predicted_y = nb_model.predict(x_test)
print(predicted_y)
test_accuracy = nb_model.score(x_test, y_test)
print(round(test_accuracy, 4))
5.决策树
dt_model = DecisionTreeClassifier()
dt_model.fit(x_train, y_train)
predicted_y = dt_model.predict(x_test)
print(predicted_y)
test_accuracy = dt_model.score(x_test, y_test)
print(round(test_accuracy, 4))
6.随机森林
rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)
predicted_y = rf_model.predict(x_test)
print(predicted_y)
test_accuracy = rf_model.score(x_test, y_test)
print(round(test_accuracy, 4))
7.非常随机的树木
et_model = ExtraTreesClassifier()
et_model.fit(x_train, y_train)
predicted_y = et_model.predict(x_test)
print(predicted_y)
test_accuracy = et_model.score(x_test, y_test)
print(round(test_accuracy, 4))
使用r
将使用该袋装包装。请注意,印刷模型时将显示每个模型的交叉验证精度。
首先,我导入包含虹膜数据集的数据集软件包
library(datasets)
指标是包含函数的软件包,其中返回某些指标的函数是一个称为精度的函数,可返回模型预测的准确性。
library(Metrics)
和数据集软件包
library(datasets)
请注意,如果您想使用未内置的数据集,那么您写
data = read.csv('filename.csv', sep ='symbol')
sep 通常是逗号,',',但也可以是一个不同的符号。您可以很好地研究数据
接下来,我导入使用许多其他软件包(您可能不一定知道)在机器学习中进行操作的商品软件包。
library(caret)
现在,我需要获取我们的数据
data = iris
# You can check the structure of the data
print(data)
使用该函数 CreateAtaPartition ,我将原始数据集的80%作为培训数据,其余20%为测试数据。
train_index = createDataPartition(y=data$Species, p=0.8, list=FALSE)
然后,我将数据分为培训和测试集。
train_data = data[train_index,]
test_data = data[-train_index,]
然后,我将目标变量转换为一个因素,因为它最初是一串字符。但是将其作为因素后,它被离散化了。
train_data$Species = factor(train_data$Species)
test_data$Species = factor(test_data$Species)
接下来,我控制 train 功能的计算差异。
# The trainControl function uses 5 k_folds for cross validation, hence number = 5
control = trainControl(method = "cv", number=5)
接下来,我适合模型。请注意,在 train 函数中, preprocess = c(“中心”,“ scale”)处理数据,以使每个变量从每个数据点减去均值由于中心,对于每个变量,它将所有数据点除以 scale 的标准偏差 - 这意味着使用 preprocess = c(“中心) “,”比例”)标准化数据。还要注意, tunelength 实际上是一个整数,表示调谐参数网格中的粒度量。
请注意, predicted_y 是给定x_test的预测值的向量。还要注意, test_accuracy 实际上是基于测试数据的模型预测的准确性数据。
1.逻辑回归
logistic_model = train(Species ~.,
data = train_data,
method = "multinom",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(logistic_model)
predicted_y = predict(logistic_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
2. Suport矢量机
svm_model = train(Species ~.,
data = train_data,
method = "svmLinear",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(svm_model)
predicted_y = predict(svm_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
3. knn
knn_model = train(Species ~.,
data = train_data,
method = "knn",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(knn_model)
predicted_y = predict(knn_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
4.天真的贝叶斯
nb_model = train(Species ~.,
data = train_data,
method = "nb",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(nb_model)
predicted_y = predict(nb_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
5.决策树
decision_tree_model = train(Species ~.,
data=train_data,
method = "rpart",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(decision_tree_model)
predicted_y = predict(decision_tree_model, test_data)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
6.随机森林
random_forest_model = train(Species ~.,
data=train_data,
method = "rf",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(random_forest_model)
predicted_y = predict(random_forest_model, test_data)
print(y_predicted)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
7.非常随机的树木
et_model = train(Species ~.,
data=train_data,
method = "ranger",
trControl=control,
preProcess = c("center", "scale"),
tuneLength = 10)
print(et_model)
predicted_y = predict(et_model, test_data)
print(y_predicted)
test_accuracy = accuracy(y_predicted, test_data[, ncol(test_data)])
print(sprintf('Test accuracy = %f', test_accuracy))
就是这样。我认为编写这些代码非常容易,只要您看到Python和R的代码中的重复模式,实际上,每种语言都有其自己独特的模式。谢谢阅读。
r新的?请参阅:
机器学习新手?请参阅:
Introduction to Machine Learning
不确定要用于ML的程序?请参阅:Best Programming Language for ML