This project analyzes the Google Play Store apps dataset, which can be found on Kaggle here. The analysis includes data cleaning, exploratory data analysis, and visualizations.
The dataset contained missing values and duplicates, which were handled using pandas methods.
The analysis includes the following findings:
The analysis is presented in two Streamlit apps, which can be accessed using the links provided above.
Model Selection and Hyperparameter Tuning
We utilized the Support Vector Machine (SVM) algorithm as our model of choice due to its ability to handle high dimensional data and its effectiveness in classification tasks. We implemented the SVM model using scikit-learn’s SVC class, which allows us to specify the kernel function and hyperparameters.
To optimize the performance of our SVM model, we employed a pipeline that consists of several steps:
1. Custom Transformer: This is a custom transformation step that applies a transformation to the data. We passed a parameter “5” to the Custom Transformer class to customize the transformation.
2. StandardScaler: This step scales the data to have zero mean and unit variance, which is an important step for many machine learning algorithms to work effectively.
3. GridSearchCV: This step performs grid search cross-validation to find the best hyperparameters for the SVM model. We optimized the hyperparameters ‘C’, ‘kernel’, and ‘gamma’ with a range of values specified in the ‘param_grid’ dictionary. The model was trained using 5-fold cross-validation, and the best hyperparameters were selected based on the average test score.
After defining the pipeline, we fit the pipeline on the training data using the fit method. We then printed the results for each permutation of hyperparameters tested in the grid search using a for loop. The loop iterated over the mean_train_score, mean_test_score, and params values from the GridSearchCV results.
Finally, we printed the best score and best hyperparameters for the SVM model found during the grid search. This allowed us to select the optimal hyperparameters for our SVM model, which helped to improve its performance on the test data.
Overall, this approach for tuning hyperparameters for a machine learning model using cross-validation is a common and effective way to optimize model performance.
This analysis provides insights into the Google Play Store apps dataset and can be used by developers and businesses to make data-driven decisions.