Exploring Udemy Courses Trends Using Google Big Query


Introduction

Google Big Query is a secure, accessible, fully-manage, pay-as-you-go, server-less, multi-cloud data warehouse Platform as a Service (PaaS) service provided by Google Cloud Platform that helps to generate useful insights from big data that will help business stakeholders in effective decision-making. Google Big Query provides built-in machine learning capability and SQL query engine to write SQL, which can be used for analyzing large datasets. We can develop a secure and highly accessible data warehouse using Google Big Query.

Udemy is one of the most popular online learning platforms. Udemy provides high-quality learning content in design, marketing, development, finance & accounting, IT & software, photography & video, health & wellness, office productivity, etc. in different languages. Udemy is an important source of information for many students, freelancers, and working professionals. Udemy is one of the best platforms to learn Python and React and to prepare for AWS and Azure certification. However, learners might be interested in taking courses from instructors more aligned to their job titles, courses taken by many users, and certified developers like AWS certified, Salesforce certified, and so on. To address this problem, we will build a data warehouse for exploring Udemy course trends and insights using Google Big Query.

Screenshot 2023 03 31 at 8.10.51 PM Exploring Udemy Courses Trends Using Google Big Query

Almost all major cloud service providers, like Google, Amazon, Microsoft, etc., today provide data warehouse tools. Cloud-based data warehouse tools are highly scalable and provide disaster recovery. Using a data warehouse we can store and analyze a large amount of data and produce useful data insights with the help of data visualizations and reports. Well-designed data warehouses deliver high-quality data and improve query performance by properly defining the type of data, using data mining, artificial intelligence, etc., and helping in making smarter decisions.

This article will discuss the approach of building a data warehouse for exploring Udemy course trends and insights using Google Big Query which will help us to identify things such as classifying courses based on instructor job titles, the average rating of all the courses of an instructor, etc.

Learning Objectives

In this article, we will learn:

  1. How to build a data warehouse using Google Big Query
  2. How to use Google Big Query Sandbox
  3. Gain knowledge about creating datasets and tables in Big Query
  4. Querying Udemy data in Big Query SQL query engine

This article was published as a part of the Data Science Blogathon.

Table of Contents

Project Description

This project aims to develop a data warehouse for exploring Udemy course trends and insights using Google Big Query, which will help us to identify things such as classifying courses based on instructor job titles, the average rating of all the courses of an instructor, etc. We will take the Udemy courses and instructor data from Kaggle and download it into our local machine. The data downloaded from Kaggle is in CSV format.

Now, we will create the table inside the dataset in the Google Cloud Platform SQL query engine from the downloaded data. After creating the table, we will format the table schema and perform data cleaning. We can perform querying on imported data to generate useful insights such as classifying courses based on instructor job titles, identifying courses having maximum ratings, instructors whose courses have good ratings, etc.

Currently, we have data from only one source, and we are importing CSV format data through batch ingestion using the Google Cloud Platform UI interface. We can also import data from multiple sources such as Cloud Storage, Azure Storage Account, etc. Apart from importing data through the Google Cloud Platform UI interface, users can also import data using CLI, and REST APIs, using data pipeline options such as Cloud Dataflow, Cloud Dataproc, etc. Google Big Query also supports file formats such as Parquet, Avro, etc., for data loading and processing. Developers can also save, share and run queries in the SQL query engine at the scheduled time.

"

By querying Udemy data, users can determine which courses they should purchase based on course duration, course ratings, instructor job titles, course popularity, etc. Users can save and share these queries. Users can also save the results of these queries to create dashboards using Power BI, Looker Studio, Tableau, etc. Users can also extract more data from Udemy using web scraping techniques and ingest it in Google Big Query SQL query engine to keep the data updated so that users can get more accurate results.

Problem Statement

In this article, we will be using Udemy Courses Data 2023 dataset from Kaggle to develop a data warehouse for exploring Udemy course trends and insights using Google Big Query, which will help us to identify things such as classifying courses based on instructor job titles, the average rating of all the courses of an instructor, classifying courses based on the number of lectures in the course, identifying recently published and modified courses on Udemy, etc.

As already discussed, we can extract more data from Udemy using web scraping techniques as new courses and instructors keep on growing on the Udemy platform. We will create tables inside the dataset in the Google Cloud Platform SQL query engine to import the courses and instructor data downloaded from Kaggle. After table creation, we will perform data cleaning and table schema formatting.

Problem Statement

We can save, share and run queries in the SQL query engine at the scheduled time. Apart from this, we can also save the results of the query execution so that it can be utilized queries to create dashboards using Power BI, Looker Studio, Tableau, etc. This project aims to develop a data warehouse using Udemy data, querying which users can identify recently published and modified courses on Udemy, classify courses based on course duration and course ratings, identify average ratings of all the courses of an instructor, classify courses based on the number of lectures in the course, etc.

Prerequisites

Below are some prerequisites to undertake this project:

  1. Understanding of Data Warehouse: In this project, we will build a data warehouse to explore Udemy course trends and insights using Google Big Query. Therefore, understanding what a data warehouse is, why a data warehouse is useful, and what the data warehouse provides by various cloud vendors, etc., are important.
  2. Experience with Google Cloud Platform: We will use Google Big Query, a data warehouse service available inside the Google Cloud Platform. So, experience with the Google Cloud Platform is important to easily navigate the platform and understand the resource creation process, roles & access permissions, etc.
  3. Experience with SQL queries: We will be writing queries in the SQL query engine to generate useful insights, such as classifying courses based on instructor job titles, identifying courses having maximum ratings, instructors whose courses have good ratings, etc.
  4. Familiarity with Udemy and Kaggle: Understanding what Kaggle is, how it is useful for downloading datasets, and basic familiarity with the online learning platform Udemy will be helpful while developing the project.
  5. Understanding of Google Big Query: As this project utilizes Google Big Query for creating a data warehouse, it would be beneficial to have an understanding of Google Big Query’s common data operations, concepts, and techniques.

Knowing about the Dataset

In this article, we will be using Udemy Courses Data 2023 dataset from Kaggle. The dataset can be downloaded by visiting https://www.kaggle.com/datasets/ankushbisht005/udemy-courses-data-2023. The goal behind using this dataset is to identify recently published and modified courses on Udemy, classify courses based on course duration and course ratings, identify average ratings of all the courses of an instructor, classify courses based on the number of lectures in the course, etc.

The Udemy Courses Data 2023 dataset has two files named courses.csv and instructors.csv. The courses.csv contains information related to the Udemy courses. The instructors.csv contains the information related to the Udemy instructors. The courses.csv contains 11 columns and 83,105 rows. The instructors.csv contains 10 columns and 32,234 rows. The courses.csv contains the instructors_id column, which gives the id of the instructor of the course. The instructors_id column is used to form the relation between courses.csv and instructors.csv.

Knowing about the dataset

The courses.csv contains the unique id of the course, the course title, course rating, course duration, the number of lectures in the Udemy course, the URL of the course, the creation date of the course, the date on which the course was last modified, number of reviews of the course and id of the course instructor. The instructors.csv contains the unique id of the instructor, the name of the course instructor, the display name of the course instructor, the title of the course instructor, the job title of the course instructor, the instructor class, the URL of the instructor, initials of the course instructor, 50 X 50 image of the instructor and 100 X 100 image of the instructor. To learn more about the dataset, visit https://www.kaggle.com/datasets/ankushbisht005/udemy-courses-data-2023.

Approach to the Project

In this project, we will be using Udemy Courses Data 2023 dataset from Kaggle to develop a data warehouse for exploring Udemy course trends and insights using Google Big Query, which will help us to identify things such as classifying courses based on instructor job titles, the average rating of all the courses of an instructor, classifying courses based on the number of lectures in the course, identifying recently published and modified courses on Udemy, etc.

Follow the below steps to create a data warehouse using Udemy Courses Data 2023 dataset from Kaggle:

Step 1: Create a New Project using Big Query Sandbox

To work with Google Big Query, developers can either create an account on the Google Cloud Platform or utilize the Google Big Query Sandbox. I will use Google Big Query Sandbox in this article to create a data warehouse. The project is used for organizing all the Google cloud resources in GCP. Using Identity and Access Management, we can specify which user is authorized to access which resources in a project.

Visit the below link to use the Google Big Query Sandbox: https://console.cloud.google.com/bigquery

Now, follow the steps described below:

1. Click on NEW PROJECT, then Provide the Project Name as Udemy-Project and Location on the next screen. Click CREATE.

Step 1: Create a new Project using Big Query Sandbox
"

2. Udemy-Project is successfully created. Select the Udemy-Project to view the project and manage user permissions and resources inside the project.

Google Big Query | trends

Step 2: Download the Dataset from Kaggle and Save it on the Local Machine

Visit https://www.kaggle.com/datasets/ankushbisht005/udemy-courses-data-2023 and click Download. After unzipping the downloaded zip file, you will find two CSV files named  courses.csv and instructors.csv. The courses.csv contains information related to the Udemy courses. The instructors.csv contains the information related to the Udemy instructors. The courses.csv contains 11 columns and 83,105 rows. The instructors.csv contains 10 columns and 32,234 rows. The instructors_id column is used to form the relation between courses.csv and instructors.csv.

Google Big Query

Step 3: Creating Dataset Inside Google Big Query Resource

Follow the steps described below to create a dataset inside Google Big Query:

1.     Select the name of the Project -> Big Query in the resources card -> Click Create dataset.

Google Big Query

2.     Provide Udemy_dataset as Dataset ID, choose Region in Location Type, choose Asia-south1 (Mumbai) as Region, and enable table expiration.

"

3.     Click CREATE DATASET

CREATE DATASET

Step 4: Create Tables in the Dataset Inside Google Big Query Resource

Follow the steps described below to create tables in the dataset inside Google Big Query:

1.     Select Udemy_dataset dataset -> Create table

trends

2.     Choose to create table from upload, select the courses.csv file downloaded from Kaggle, select file format as CSV, provide courses as table name, Native table as a table type, choose Auto to detect in the schema, and partition and cluster settings as per our requirements. In the Advance options, provide 1 in the header rows to skip and choose Encryption suitable as per the requirement. Click CREATE TABLE.

Google big Query

3.     Now, again select the Udemy_dataset dataset
-> Create table. Choose to create a table from upload, select the instructors.csv file downloaded from Kaggle, select file format as CSV, provide instructors as table name, Native table as a table type, choose Auto to detect in the schema, and partition and cluster settings as per our requirements. In the Advance options, provide 1 in the header rows to skip and choose Encryption suitable as per the requirement. Click CREATE TABLE.

"

Step 5: Verifying Tables Schema and Previewing Data

Go to the courses table, and cross-verify the field name, type, and mode in the schema tab. View the row access policies of the courses table and edit table schema, if required. View the table info in the DETAILS tab and edit the details in case of corrections. We can also preview, copy, refresh, and share the data. Similarly, go to the instructors’ table, and cross-verify the field name, type, and mode in the schema tab. View the row access policies of the instructors’ table and edit the table schema if required.

"
"

To see 5000 records from the courses table, execute the below query in the SQL query engine:

SELECT * FROM `udemy-project-381211.Udemy_dataset.courses` LIMIT 5000
trends

To see 5000 records from the instructors’ table, execute the below query in the SQL query engine:

SELECT * FROM `udemy-project-381211.Udemy_dataset.instructors` LIMIT 5000
trends

A. Find the title of all courses whose ratings are greater than 4.5 and more than 10000 persons has given the rating for these courses. Display these courses in decreasing order of course ratings and creation date.

SELECT title AS course_title FROM `udemy-project-381211.Udemy_dataset.courses` 
WHERE rating>4.5 and num_reviews>10000
ORDER BY rating DESC, created DESC
"

B. Find the details of the 10 newly created Udemy courses.

SELECT  * FROM `udemy-project-381211.Udemy_dataset.courses` 
ORDER BY created DESC
LIMIT 10
Google big Query

C.    Find the details of the 10 recently modified Udemy courses.

SELECT  * FROM `udemy-project-381211.Udemy_dataset.courses` 
ORDER BY last_update_date DESC
LIMIT 10
"

D. Find the details of the JavaScript courses whose ratings are greater than 4 and more than 20000 persons have given the rating for these courses.

SELECT *  FROM `udemy-project-381211.Udemy_dataset.courses` 
WHERE title LIKE '%JavaScript%' AND
rating>4 AND num_reviews>20000
"

E. Display the title, rating, and number of lectures of the Udemy React courses which has greater than 50-course lectures.

SELECT title AS course_title, rating AS course_rating, num_published_lectures as course_lectures  
FROM `udemy-project-381211.Udemy_dataset.courses` 
WHERE title LIKE '%React%' AND
num_published_lectures>50
"

F. Find the number of courses, and course instructor name developed by the course instructors with course ratings greater than average ratings of the courses.

SELECT COUNT(courses.id), instructors.name
FROM `Udemy_dataset.instructors` instructors
LEFT JOIN `Udemy_dataset.courses` courses
ON instructors.id = courses.instructors_id
WHERE courses.instructors_id IN 
(SELECT instructors_id FROM `Udemy_dataset.courses` 
WHERE rating >(SELECT AVG(rating) FROM `Udemy_dataset.courses`))
GROUP BY instructors.name
"

G. Display the course instructor name and title of the Udemy courses created by people whose job title is a web developer
and whose course ratings are greater than 4.2.

SELECT instructors.display_name, courses.title as course_title
FROM `Udemy_dataset.instructors` instructors
LEFT JOIN `Udemy_dataset.courses` courses
ON instructors.id = courses.instructors_id
WHERE instructors.job_title LIKE '%Web developer%' and courses.rating>4.2
trends

H. Display the course title, course instructor name, ratings, and course duration of the Udemy courses where the course duration is greater than 40 mins, 40 hours, or 40 questions.

SELECT courses.title as course_title, 
instructors.display_name as course_instructor, courses.rating, courses.duration
FROM `Udemy_dataset.instructors` instructors
LEFT JOIN `Udemy_dataset.courses` courses
ON instructors.id = courses.instructors_id
WHERE 
CASE WHEN courses.duration LIKE '%.%'
      THEN CAST(LEFT(courses.duration, STRPOS(courses.duration,'.')-1) AS FLOAT64)>40
    WHEN courses.duration LIKE '%total%'
      THEN CAST(LEFT(courses.duration, STRPOS(courses.duration,'t')-1) AS FLOAT64)>40
    WHEN courses.duration LIKE '%ques%'
      THEN CAST(LEFT(courses.duration, STRPOS(courses.duration,'q')-1) AS FLOAT64)>40
END
"

I.     Display the course instructor name and title of the Udemy courses created by certified developers.

SELECT courses.title as course_title, instructors.display_name as course_instructor
FROM `Udemy_dataset.instructors` instructors 
LEFT JOIN `Udemy_dataset.courses` courses
ON instructors.id = courses.instructors_id
WHERE instructors.job_title LIKE '%certified%'
Google big Query

J.     Find all the distinct job titles of Udemy course instructors.

SELECT DISTINCT instructors.job_title
FROM `Udemy_dataset.instructors` instructors
Google big Query | trends

K. Find the title, ratings, and instructor of all courses whose ratings are greater than 4 and more than 17000 persons have given the rating for these courses. Display these courses in decreasing order of course ratings.

SELECT courses.title as course_title, instructors.display_name as course_instructor, courses.rating
FROM `Udemy_dataset.instructors` instructors
LEFT JOIN `Udemy_dataset.courses` courses
ON instructors.id = courses.instructors_id
WHERE courses.rating > 4 and courses.num_reviews > 17000
ORDER BY courses.rating DESC
Google big Query | trends

L. Find the details of the 20 newly created Azure Udemy courses.

SELECT  * FROM `udemy-project-381211.Udemy_dataset.courses` 
WHERE title LIKE '%Azure%'
ORDER BY created DESC
LIMIT 20
Google big Query

M. Find the details of the 15 newly created AWS Udemy courses.

SELECT  * FROM `udemy-project-381211.Udemy_dataset.courses` 
WHERE title LIKE '%AWS%'
ORDER BY created DESC
LIMIT 15
Google big Query | trends

N. Display all the details of the Udemy SAS courses which have course lectures between 112 and 156 in increasing order of course title.

SELECT *  FROM `udemy-project-381211.Udemy_dataset.courses` 
WHERE title LIKE '%SAS %' AND
num_published_lectures BETWEEN 112 AND 156
ORDER BY title
Google big Query

O. Display the course instructor name, title, ratings, and the course reviews of the top two Udemy Azure Data Factory courses based on course ratings and the number of course reviews.

SELECT courses.title as course_title, 
instructors.display_name as course_instructor, courses.rating, courses.num_reviews
FROM `Udemy_dataset.instructors` instructors
LEFT JOIN `Udemy_dataset.courses` courses
ON instructors.id = courses.instructors_id
WHERE courses.title LIKE '%Azure Data Factory %'
ORDER BY courses.num_reviews DESC, courses.rating DESC  
LIMIT 2
Google big Query

P. Display the course instructor name, title, ratings, and course reviews of the best Udemy Salesforce course based on course ratings and the number of course reviews.

SELECT courses.title as course_title, instructors.display_name as course_instructor, 
courses.rating, courses.num_reviews
FROM `Udemy_dataset.instructors` instructors
LEFT JOIN `Udemy_dataset.courses` courses
ON instructors.id = courses.instructors_id
WHERE courses.title LIKE '%Salesforce %'
ORDER BY courses.num_reviews DESC, courses.rating DESC  
LIMIT 1
Google big Query

From the above, we know how to build a data warehouse for exploring Udemy course trends and insights using Google Big Query. Below are some key trends and insights discovered while exploring the Udemy courses data:

1. The most popular JavaScript courses have an average rating greater than 4.6.

2. Only 34 Udemy courses are created by instructors whose job title is a web developer and whose course ratings are greater than 4.2.

3. Almost 150 Udemy courses are created by AWS, Azure, GCP, or Salesforce-certified developers.

4. Ramesh Retnasamy creates the most popular Azure Data Factory course on Udemy.

5. Recently created Azure and AWS courses are very popular on Udemy.

6. Udemy users prefer to enroll in SAS courses with about 100-150 lectures with good ratings.

Conclusion

In this article, we have seen how to build a data warehouse for exploring Udemy course trends and insights using Google Big Query. A data warehouse stores and analyze a large amount of data and produce useful data insights with the help of data visualizations and reports. We have seen how to create a table by importing data from Kaggle in Google Big Query. We also understand how to create relationships between tables to understand data better. We looked at how to analyze the data with the help of queries to get meaningful insight from the data. Below are the major takeaways from the above article:

  1. We have seen how we can create tables in Google Big Query.
  2. We understood how to query data in the Big Query SQL query engine.
  3. We have also identified details of the Udemy courses created by people whose job title is a web developer and whose course ratings are greater than 4.2.
  4. We have also seen how many courses on Udemy are created by certified developers.
  5. We have figured out the newly created Azure and AWS courses on Udemy basis the trends.
  6. Apart from that, we have also seen other course trends on Udemy by exploring Udemy data inside the SQL query engine.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 



Source link

Leave a Comment