딥러닝의 기본 학습 - Logistic (Regression) Classification -이론,실습-4

Richrad Chung 2020. 8. 27. 13:35

로지스틱 회귀란 무엇인가

로지스틱 회귀(Logistic Regression)는 회귀를 사용하여 데이터가 어떤 범주에 속할 확률을 0에서 1 사이의 값으로 예측하고 그 확률에 따라 가능성이 더 높은 범주에 속하는 것으로 분류해주는 지도 학습 알고리즘이다.

스팸 메일 분류기 같은 예시를 생각하면 쉽다. 어떤 메일을 받았을 때 그것이 스팸일 확률이 0.5 이상이면 spam으로 분류하고, 확률이 0.5보다 작은 경우 ham으로 분류하는 거다. 이렇게 데이터가 2개의 범주 중 하나에 속하도록 결정하는 것을 2진 분류(binary classification)라고 한다.

Classification 알고리즘들 중에서 굉장히 정확도가 높은 알고리즘으로 알려져 있다. 따라서 실제 문제에도 바로 적용해볼 수 있을 정도로 좋은 알고리즘이다. 또한 머신러닝의 Neural Network과 Deep Learning의 중요한 요소로 작용하기 때문에 자세히 알아놓아야 한다.

예를 들어보자

이진 분류 그래프에서 값의 변화의 기준을 선형회귀로 표현할수 있으나, 문제점이 있다.

단순 처리는 문제가 안되지만 값의 변화량이 크게 되면 부정확해 진다.

이를 만족하는 선형이 log형테의 그래프 이고, sigmoid이다, 이 함수를 이용하게 되면 어떤 한 값을 대입시키더라도 0과 1사이의 값을 가지게 된다

$H(X) = \frac{1}{1 + e^{-W^{T}X}} \\ cost(W)=-\frac{1}{m}\sum {\color{blue}y} log(H(x)) + {\color{blue}(1-y)}(log(1-H(x)) \\ W = W - \alpha\frac{\partial}{\partial W}cost(W)$

실습

# Lab 5 Logistic Regression Classifier
import tensorflow as tf
import numpy as np
tf.set_random_seed(777)  # for reproducibility

xy = np.loadtxt('data-03-diabetes.csv', delimiter=',', dtype=np.float32)
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

print(x_data.shape, y_data.shape)

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 8])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([8, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis using sigmoid: tf.div(1., 1. + tf.exp(-tf.matmul(X, W)))
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# cost/loss function
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) *
                       tf.log(1 - hypothesis))

train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

# Accuracy computation
# True if hypothesis>0.5 else False
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

# Launch graph
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())

    for step in range(10001):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0:
            print(step, cost_val)

    # Accuracy report
    h, c, a = sess.run([hypothesis, predicted, accuracy],
                       feed_dict={X: x_data, Y: y_data})
    print("\nHypothesis: ", h, "\nCorrect (Y): ", c, "\nAccuracy: ", a)

'''
0 0.82794
200 0.755181
400 0.726355
600 0.705179
800 0.686631
...
9600 0.492056
9800 0.491396
10000 0.490767

...

 [ 1.]
 [ 1.]
 [ 1.]]
Accuracy:  0.762846
'''

저작자표시 비영리 변경금지