import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
# Feature Engineering
df[‘Interaction’] = df[‘% OBESE’] * df[‘% INACTIVE’]
df[‘% OBESE_squared’] = df[‘% OBESE’] ** 2
df[‘% INACTIVE_squared’] = df[‘% INACTIVE’] ** 2
# Define your dependent variable and independent variables
X = df[[‘% OBESE’, ‘% INACTIVE’, ‘Interaction’, ‘% OBESE_squared’, ‘% INACTIVE_squared’]] # Add new features
y = df[‘% DIABETIC’] # Replace with the actual column name
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for numerical features (scaling)
numerical_transformer = Pipeline(steps=[
(‘scaler’, StandardScaler())
])
# Create and fit the model in a pipeline
model = Pipeline(steps=[
(‘preprocessor’, numerical_transformer),
(‘model’, LinearRegression())
])
# Calculate the negative MSE using cross-validation
cv_mse_scores = -cross_val_score(model, X, y, cv=5, scoring=’neg_mean_squared_error’)
mse_cv_mean = cv_mse_scores.mean()
print(f”Cross-Validation Mean Squared Error: {mse_cv_mean:.2f}”)
# Fit the model on the training data
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Calculate the MSE on the testing set
mse = mean_squared_error(y_test, y_pred)
print(f”Mean Squared Error: {mse:.2f}”)
# Visualize the results for each independent variable
independent_variables = [‘% OBESE’, ‘% INACTIVE’, ‘Interaction’, ‘% OBESE_squared’, ‘% INACTIVE_squared’]
for variable in independent_variables:
# Scatter plot
plt.scatter(X_test[variable], y_test, color=’blue’, label=’Actual Data’)
# Regression line
x_range = np.linspace(X_test[variable].min(), X_test[variable].max(), num=len(X_test))
coefficients = model.named_steps[‘model’].coef_
intercept = model.named_steps[‘model’].intercept_
y_range = intercept + coefficients[independent_variables.index(variable)] * model.named_steps[‘preprocessor’].transform(X_test)[:, independent_variables.index(variable)]
plt.plot(x_range, y_range, color=’red’, label=’Regression Line’)
plt.xlabel(variable)
plt.ylabel(‘% DIABETIC’)
plt.legend()
plt.show()
The cross-validation mean squared error (CV MSE) is 0.36, and the mean squared error (MSE) on the testing set is 0.43. These values represent the model’s performance in terms of how well it predicts the ‘% DIABETIC’ target variable. Lower values of MSE indicate better predictive performance, so a lower MSE is desirable.
Here’s how you can interpret these results:
Cross-Validation MSE (CV MSE): This metric is calculated using cross-validation, which provides a more robust estimate of the model’s performance. In your case, the CV MSE of 0.36 suggests that, on average, the model’s predictions are off by a squared value of 0.36. Lower CV MSE indicates better model performance.
MSE on the Testing Set: This metric represents how well the model generalizes to new, unseen data (the testing set). A MSE of 0.43 means that, on average, the model’s predictions on the testing set are off by a squared value of 0.43.
In both cases, a lower MSE would be preferred, as it indicates that the model is making more accurate predictions. Additionally, you can further explore model improvement strategies, such as trying different algorithms, hyperparameter tuning, or collecting more data if feasible, to potentially reduce the MSE even further.