일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- teen learn
- 빅데이터 분석기사
- numpy
- 준비
- separating data(데이터 분리하기)
- 응시료
- context manger1
- matplotlib
- List Comprehension
- 시험 일정
- K 데이터 자격시험
- Seaborn
- pythonML
- 검정수수료
Archives
- Today
- Total
재원's 블로그
ml pipeline1 (머신러닝 파이프라인1) 본문
최초 작성일 : 2021-12-23
categories: Python Machine Learning
ml(머신 러닝)에서 'pipeline'라는 것을 만들어 사용 할 수 있는데
오늘은 여기에 대해 알아보고 직접 만드는 코드로 실습을 해봤다.
ML(머신러닝) PipeLine 기본편
데이터 불러오기
import pandas as pd
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
df = pd.read_csv(data_url, names=column_name)
df.info()
<실행 화면>
df.head()
<실행 화면>
from sklearn.preprocessing import LabelEncoder
x = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values
le = LabelEncoder()
#LabelEncoder변환
y = le.fit_transform(y)
le.classes_
<실행 화면>
훈련세트 분리
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.20, stratify = y, random_state = 1)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
<실행 화면>
파이프라인으로 변환기와 추정기 연결 (PipeLine 설계)
● Scale, PCA (차원축소)
○ 30 차원을 2차원으로 축소
● 로지스틱회귀모형
● Wrapper
● 훈련 세트에만 사용 (테스트에는 사용안함)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe_lr = make_pipeline(StandardScaler(),
PCA(n_components = 2),
LogisticRegression(solver = "liblinear", random_state = 1))
pipe_lr.fit(x_train, y_train)
<실행 화면>
y_pred = pipe_lr.predict(x_test)
print("테스트 정확도 :", pipe_lr.score(x_test, y_test))
<실행 화면>
교차검증 (cross validation)
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
import numpy as np
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
df = pd.read_csv(data_url, names=column_name)
print(df.info())
X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values
le = LabelEncoder()
y = le.fit_transform(y)
print("종속변수 클래스:", le.classes_)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state=1)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pipe_lr = make_pipeline(StandardScaler(),
PCA(n_components=2),
LogisticRegression(solver="liblinear", random_state=1))
kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True).split(X_train, y_train)
scores = []
for k, (train, test) in enumerate(kfold):
pipe_lr.fit(X_train[train], y_train[train])
score = pipe_lr.score(X_train[test], y_train[test])
scores.append(score)
print("폴드: %2d, 클래스 분포: %s, 정확도: %.3f" % (k+1, np.bincount(y_train[train]), score))
print("\nCV 정확도: %.3f +/- %.3f" % (np.mean(scores), np.std(scores)))
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_lr,
X = X_train,
y = y_train,
cv = 10,
n_jobs = 1)
print("CV 정확도 점수 : %s" % scores)
print("CV 정확도 : %.3f +/- %.3f" % (np.mean(scores), np.std(scores)))
<실행 화면>
'Python Machine Learning' 카테고리의 다른 글
text mainig python (machine learning) (0) | 2023.01.24 |
---|---|
ml pipeline2 (머신러닝 파이프라인2) (0) | 2023.01.24 |
decision tree(결정 나무)에 대해 (0) | 2023.01.21 |
machine learning evaluation index(머신러닝 평가지표) (0) | 2023.01.20 |
teen learn (0) | 2023.01.20 |