cross validation used in project

Cross-validation is invaluable in machine learning and model development as it provides a robust and unbiased assessment of a model’s performance. By systematically partitioning the dataset into training and testing subsets multiple times, it helps detect issues like overfitting and ensures that the model generalizes well to unseen data. This technique aids in hyperparameter tuning, model selection, and estimating how well the model will perform in real-world applications. It enhances the reliability of performance metrics, reduces the risk of data-driven model biases, and offers a more accurate representation of a model’s true capabilities, making it an essential practice for building effective and trustworthy machine learning models.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns # Import Seaborn

# Feature Engineering
df[‘Interaction’] = df[‘% OBESE’] * df[‘% INACTIVE’]
df[‘% OBESE_squared’] = df[‘% OBESE’] ** 2
df[‘% INACTIVE_squared’] = df[‘% INACTIVE’] ** 2

# Define dependent variable and independent variables
X = df[[‘% OBESE’, ‘% INACTIVE’, ‘Interaction’, ‘% OBESE_squared’, ‘% INACTIVE_squared’]] # Add new features
y = df[‘% DIABETIC’] # Replace with the actual column name

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical features (scaling)
numerical_transformer = Pipeline(steps=[
(‘scaler’, StandardScaler())
])

# Create and fit the Linear Regression model in a pipeline
linear_model = Pipeline(steps=[
(‘preprocessor’, numerical_transformer),
(‘model’, LinearRegression())
])

# Fit the model on the training data
linear_model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = linear_model.predict(X_test)

# Calculate the Mean Squared Error (MSE) on the testing set
mse = mean_squared_error(y_test, y_pred)

print(f”Mean Squared Error: {mse:.2f}”)

# Calculate the R-squared value
r2 = r2_score(y_test, y_pred)

print(f”R-squared: {r2:.2f}”)

# Create a DataFrame containing X_test and predictions for visualization
results_df = pd.DataFrame({‘% OBESE’: X_test[‘% OBESE’],
‘% INACTIVE’: X_test[‘% INACTIVE’],
‘Prediction’: y_pred})

# Create a heatmap to visualize the relationships
heatmap_data = results_df.pivot_table(index=’% OBESE’, columns=’% INACTIVE’, values=’Prediction’)
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, annot=True, cmap=’coolwarm’, linewidths=.5)
plt.title(‘Heatmap of Predictions (Linear Regression)’)
plt.xlabel(‘% INACTIVE’)
plt.ylabel(‘% OBESE’)
plt.show()
Mean Squared Error: 0.43
R-squared: 0.36
After using cross validation we are able to reduce the chances of overfit in our model actually the r-squared value decrease seems we are recuding the overfit to the data

Leave a Reply Cancel reply