(Easy to follow) Introduction to AI with K-means, Python and Google Collaboratory

(Easy to follow) Introduction to AI with K-means, Python and Google Collaboratory

In this post we will do our first AI project together using the k-means algorithm with Python to try finding clusters in our spotify data!

What does K-Means means?

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups).

The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. For example if K would have the value 3 the algorithm would try to find 3 clusters/groups in our data.

The algorithm iteratively assigns each data point to one of K (e.g. 3) groups based on the features that are provided.

Data points are clustered based on similarity.

Steps to use k-Means in a Google Collaboratory

  1. Go to https://colab.research.google.com/

  2. Create a new Python 3 Notebook.

  3. Download dataset of your choice from: https://www.kaggle.com/datasets?sortBy=hotness&group=public&page=1&pageSize=20&size=all&filetype=all&license=all

    In this example I will use: https://www.kaggle.com/nadintamer/top-spotify-tracks-of-2018

    download

  1. Save your file (top2018.csv in our case) next to where your google collaboratory document is.

    top2018

  2. Add a new code section by hitting the plus sign on the Code Tab

    add

    Copy & Paste the following to do the setup for your Notebook and execute it by hitting on the play button:

     Upload files from shared GDrive folder to Google Colab Workbook
    
    !pip install -U -q PyDrive
    
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive
    from google.colab import auth
    from oauth2client.client import GoogleCredentials
    from sklearn.preprocessing import StandardScaler
    
    
    # 1. Authenticate and create the PyDrive client.
    auth.authenticate_user()
    gauth = GoogleAuth()
    gauth.credentials = GoogleCredentials.get_application_default()
    drive = GoogleDrive(gauth)
    
    # Upload Spotify_TopSongs.csv
    json_import = drive.CreateFile({'id':'1Casrjx_QyieSrAVuJ1-gScduwvZFGXcr'})
    json_import.GetContentFile('top2018.csv')

    Note: it will probably ask you to go to another site. On that site you will find the key you have to enter into the upcoming input field for verification.

    Also have in mind that you have to find out and put the id of your own top2018.csv file. if you will use the mentioned id in jsonimport = drive.CreateFile({'id':'1CasrjxQyieSrAVuJ1-gScduwvZFGXcr'}) it will probably won't work for you.

    To find out the id find the file in your google drive, do right click and hit "get shareable link". Paste the id in the browser and take the id from its url.

    Have a look at following image:

    id

  3. Add a new code section and import some libraries that we need to do our operations later

    %matplotlib inline
    import matplotlib.pyplot as plt
    import seaborn as sns; sns.set()  # for plot styling
    import numpy as np
    import pandas as pd
  4. Add a new code section --> save your dataset in a variable and get a first overview of your data with .describe()

    df = pd.read_csv('top2018.csv')
    df.describe()

    you should see something like this:

    describe

  5. Now you can choose two columns you want to compare

    In this example we will take danceability and energy.

    Therefor we make a new variable that will only contain the data we need.

    df2 = df[['name','danceability', 'energy']]
    df2.head()

    head() shows you something like:

    head

  6. It's time to scatter plot our data to see how they look like visually

    Add a new code section with following content:

    scatter1 = plt.scatter(df2['danceability'], df2['energy'])
    ## plt.scatter(x, y, s=area, c=colors, alpha=0.5)
    plt.xlabel('Danceability', fontsize=10)
    plt.ylabel('Energy', fontsize=10)

    It will show us following:

    scatter
    As we can see already just by eye is that the data cannot be easily clustered into groups. But nevertheless lets see what kind of clusters the kmeans algorithm will find for us :)

  7. Use the kmeans algorithm to cluster our data into groups:

    #use the kmeans algorithm on our data
    
    from sklearn.cluster import KMeans
    
    kmeans = KMeans(n_clusters=3)
    kmeans.fit(df2)
    y_kmeans = kmeans.predict(df2)
    
    
    #scatter our data with colors
    
    plt.scatter(df2['danceability'], df2['energy'], c=y_kmeans, s=50, cmap='viridis')
    
    centers = kmeans.cluster_centers_
    plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

    it should show us following:

    result
    As we said before the data cannot be split into clusters (we told kmeans in this case to find 3 clusters) but still you get an idea of how it might work. If you want you can try other datasets from Keggle and maybe you will find data that can be clustered more easily.


    If there should be any issues during following the tutorial please leave a comment below or feel free to contact me directly on social media :)

Did you like it? Why don't you try also...

How to easily add font awesome to your Vue CLI project in under 2 min.

How to easily add font awesome to your Vue CLI project in under 2 min.

In this post I will shortly show you how to easily add fontawesome to your Vue Cli project.

Build & Deploy your first blog in under 5 minutes with Gatsby & Surge

Build & Deploy your first blog in under 5 minutes with Gatsby & Surge

If you want to know how to build your own blog in under 5 minutes using Gatsby check out this post.

403 Forbidden error message when you try to push to a GitHub repository using HTTPS

403 Forbidden error message when you try to push to a GitHub repository using HTTPS

I created a new github account and wanted to push my first project to it. This article shows you how to enforce a new password prompt when getting a 403 error during git push.