/ MODELING

Machine Learning 실습예제 - k-means clustering

Modeling 구성

K-Means Clustering

We will use the following imports for this problem

import os
import numpy as np
import pandas as pd

Part B

Implement a Python function called cost that takes in a cluster and a centroid, and returns the cost of the cluster associated with that centroid. Assume that each point is given as a list of coordinates, and each cluster is given as a list of points. The five data points you worked with in part (a) are given in this format below.

Hint: Recall that $cost(\mu_j)$ is the total squared distance of each $\overrightarrow{x_i}$ in $C_j$ to $\mu_j$. Start by defining the helper function squared_distance.

def squared_distance(x, u):
    """Function to calculate the square of the Euclidian distance between a data point x and its centroid u"""
    return (x[0] - u[0]) ** 2 + (x[1] - u[1]) ** 2
def cost(C, u):
    """Function to calulate the cost of a cluster and its centroid"""
    cost = 0
    for i in C:
        distance = squared_distance(i, u)
        cost = cost + distance
    return cost
x1=[3,10]
x2=[5,76]
x3=[1,8]
x4=[2,9]
x5=[3,78]

C1= [x1, x3, x4]
C2= [x2, x5]

Now, calculate the total cost before the first iteration and after the first iteration of the algorithm, with these five data points. Recall that the total cost $cost(\mu_1, \dots, \mu_k)$ is defined as \(cost(\mu_1, \dots, \mu_k) = cost(\mu_1)+\dots+ cost(\mu_k).\) You should find that the cost has decreased after the first iteration.

before_cost = cost(C1, x1) + cost(C2, x2)
after_cost = cost(C1, [2, 9]) + cost(C2, [4, 77])

print("The total cost before the first iteration is", before_cost)
print("The total cost after the first iteration is", after_cost)
The total cost before the first iteration is 18
The total cost after the first iteration is 8

Part C

Using the data points in k_means_data.csv, follow the prompts below to implement the k-means clustering algorithm with $k=3$ clusters.

We’ll start by reading in the data.

fp = os.path.join("k_means_data.csv")
table = pd.read_csv(fp) #Import the data
data = table.values.tolist() #Put the data in a form that is easier to use

First, write a function to initialize the centroids by choosing $k$ random data points. We’ll use initialize_centroids with $k=3$ since we want to find three clusters.

def initialize_centroids(data, k):
    """Return k randomly selected centroids from among the data points"""
    random_numbers = np.random.choice(len(data), 3, replace = False)
    return [data[random_numbers[0]], data[random_numbers[1]], data[random_numbers[2]]]
centroids = initialize_centroids(data, 3)
centroids
[[2.4257491105903943, 7.9548128907273234],
 [6.705188928008579, 78.00202552760933],
 [3.9966062985857884, 77.1128640020023]]

Next, implement a function that determines the closest centroid for a given data point. The function find_closest_centroid should take as input a single data point as well as a list of centroids. It should return the centroid that’s closest to the given data point.

Hint: You can use the squared_distance function you’ve implemented above.

def find_closest_centroid(point, centroids):
    """Function to determine the closest centroid to a given data point"""
    lst = []
    for i in centroids:
        distance = squared_distance(point, i)
        lst.append(distance)
    minimum = min(lst)
    index = lst.index(minimum)
    return centroids[index]

The function below puts each data point into a group (cluster), based on which of the centroids it is closest to. The groups are labeled 1 through $k$, where group 1 contains the points that are closest to the first centroid in the centroids list.

def define_groups(data, centroids):   
    groups = []
    for point in data:
        closest = find_closest_centroid(point, centroids)       
        groups.append(centroids.index(closest)+1)
    return groups
group_list = define_groups(data, centroids)

Now, we need to relocate the centroids to the average of all points in that group. The function to do that is provided below. You should read through it and make sure you understand how it works.

def update_centroids(data, k, group_list):
    """Function that takes in data, k, and a list of groups and returns a list of new centroids"""
    new_centroids=[]
    for i in np.arange(k)+1:               #for each cluster
        total = np.array([0.0,0.0])
        count = 0
        for j in range(len(data)):          #go through each data point
            if group_list[j]==i:                   #if the data point is in this cluster
                total+=np.array(data[j])                          #add the data point to this cluster's total
                count+=1                                          #and increment the count of points in this cluster
        new_centroid = total/count
        new_centroids.append(list(new_centroid))
    return new_centroids
new_centroids = update_centroids(data, 3, group_list)
new_centroids
[[-1.1198909887743935, 14.547028738295456],
 [6.097314819873166, 78.12828163677702],
 [4.4561307948150395, 77.47263629489235]]

Finally, put together the functions above to run the k-means algorithm to find the centroids that yield the least cost. To make it a little easier, just update the centroids a fixed number of times, $1000$ times. You can think about how you would implement a better stopping condition.

Complete the k_means function below. The inputs are the data, the value of $k$, and the initial centroids. The function should return a list of the centroids after $1000$ updates.

def k_means(data, k, initial_centroids):
    """Implement the k-means algorithm given data points, number of clusters, and initial centroids as input.
    Return a list of centroids with least cost."""
    group_list = define_groups(data, initial_centroids)
    for i in range(1000):
        updated_centriods = update_centroids(data, k, group_list)
        group_list = define_groups(data, updated_centriods)
    return updated_centriods
k_means(data, 3, initialize_centroids(data, 3))
[[1.9920820677628333, 9.118216804679005],
 [5.028058561123173, 77.7011187625188],
 [-4.231864045311622, 19.9758406719119]]

For this part, turn in screenshots of the three functions you wrote:

  • initialize_centroids
  • find_closest_centroid
  • k_means

as well as the output of your k_means function.

x_values = []
y_values = []
for i in data:
    x_values.append(i[0])
    y_values.append(i[1])
data_df = {
    'X': x_values,
    'Y': y_values
}
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data_df, columns = ['X', 'Y'])
df
X Y
0 0.671723 11.737969
1 -3.757326 18.536079
2 5.275413 78.514845
3 4.006203 77.254795
4 -5.764720 20.356741
... ... ...
193 4.050378 77.779269
194 -3.449263 18.837490
195 3.740859 76.031105
196 -5.839885 17.661508
197 1.513687 10.920654

198 rows × 2 columns

centroids_1000 = k_means(data, 3, initialize_centroids(data, 3))
centroids_1000
[[-4.231864045311622, 19.9758406719119],
 [5.028058561123173, 77.7011187625188],
 [1.9920820677628333, 9.118216804679005]]
plt.plot(df['X'], df['Y'], '.')
plt.plot(centroids_1000[0][0], centroids_1000[0][1], 'bo')
plt.plot(centroids_1000[1][0], centroids_1000[1][1], 'ro')
plt.plot(centroids_1000[2][0], centroids_1000[2][1], 'ko')
[<matplotlib.lines.Line2D at 0x190a2d63e80>]

png