Project 2 Report
project analysis
# Extract year and month from the ‘date’ column for further analysis
data[‘date’] = pd.to_datetime(data[‘date’])
data[‘year’] = data[‘date’].dt.year
data[‘month’] = data[‘date’].dt.month
# Plotting the temporal trends
plt.figure(figsize=(20, 7))
# Plotting yearly trends
plt.subplot(1, 2, 1)
sns.countplot(data=data, x=’year’, palette=”viridis”)
plt.title(“Yearly Trend of Police Shootings”)
# Plotting monthly trends (averaged over years)
plt.subplot(1, 2, 2)
sns.countplot(data=data, x=’month’, palette=”viridis”)
plt.title(“Average Monthly Distribution of Police Shootings”)
# Display the plots
plt.tight_layout()
plt.show()
Here’s the analysis of police shootings based on the days of the week:
Incidents are fairly evenly distributed across the days of the week. A slight increase is observed on Wednesdays and Fridays, while Sundays tend to have slightly fewer incidents compared to other days.
Here’s the analysis of police shootings based on the days of the week:
Incidents are fairly evenly distributed across the days of the week. A slight increase is observed on Wednesdays and Fridays, while Sundays tend to have slightly fewer incidents compared to other days.
project analysis mth
# Extract year and month from the ‘date’ column for further analysis
data[‘date’] = pd.to_datetime(data[‘date’])
data[‘year’] = data[‘date’].dt.year
data[‘month’] = data[‘date’].dt.month
# Plotting the temporal trends
plt.figure(figsize=(20, 7))
# Plotting yearly trends
plt.subplot(1, 2, 1)
sns.countplot(data=data, x=’year’, palette=”viridis”)
plt.title(“Yearly Trend of Police Shootings”)
# Plotting monthly trends (averaged over years)
plt.subplot(1, 2, 2)
sns.countplot(data=data, x=’month’, palette=”viridis”)
plt.title(“Average Monthly Distribution of Police Shootings”)
# Display the plots
plt.tight_layout()
plt.show()
Here’s the analysis of police shootings based on the days of the week:
Incidents are fairly evenly distributed across the days of the week. A slight increase is observed on Wednesdays and Fridays, while Sundays tend to have slightly fewer incidents compared to other days.
Here’s the analysis of police shootings based on the days of the week:
Incidents are fairly evenly distributed across the days of the week. A slight increase is observed on Wednesdays and Fridays, while Sundays tend to have slightly fewer incidents compared to other days.
project analsis
# Filling missing values based on the mentioned strategies
# Filling ‘flee’ with ‘unknown’
data[‘flee’].fillna(‘unknown’, inplace=True)
# Filling ‘age’ with the median
data[‘age’].fillna(data[‘age’].median(), inplace=True)
# Filling ‘name’ with ‘Unknown’
data[‘name’].fillna(‘Unknown’, inplace=True)
# Filling ‘armed’ with ‘undetermined’
data[‘armed’].fillna(‘undetermined’, inplace=True)
# Filling ‘city’ with ‘Unknown_city’
data[‘city’].fillna(‘Unknown_city’, inplace=True)
# Filling ‘gender’ with ‘undetermined’
data[‘gender’].fillna(‘undetermined’, inplace=True)
# Checking if all missing values are filled
missing_after_fill = data.isnull().sum()
missing_after_fill[missing_after_fill > 0]
strategies used for missing values in project
project analysis
# Distribution of armed status based on race
plt.figure(figsize=(15, 8))
sns.countplot(data=data, y=’armed’, hue=’race’, order=data[‘armed’].value_counts().iloc[:10].index, palette=”viridis”)
plt.title(“Distribution of Armed Status Based on Race (Top 10 Armed Categories)”)
plt.xlabel(“Number of Shootings”)
plt.ylabel(“Armed Status”)
plt.legend(title=”Race”, loc=”lower right”)
plt.show()
Guns: Predominantly, individuals from the White, Black, and Hispanic racial categories are armed with guns. Knives: A significant number of White individuals are armed with knives, followed by Black and Hispanic individuals. Unarmed: A concerning observation is the number of unarmed Black individuals, which is notably high compared to other racial categories. Other categories: Similar trends are observed across other categories, with White individuals being the predominant group, followed by Black and Hispanic individuals. This visualization provides insights into the disparities and patterns related to the armed status of individuals across different racial categories
project analysis
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
# Fitting the Simple Exponential Smoothing model
ses_model = SimpleExpSmoothing(monthly_counts[‘count’]).fit()
# Forecasting for the next 12 months
ses_forecast = ses_model.forecast(steps=12)
# Dates for the forecasted period
forecast_dates_ses = pd.date_range(monthly_counts.index[-1] + pd.DateOffset(months=1), periods=12, freq=’M’)
# Plotting the results
plt.figure(figsize=(15, 7))
plt.plot(monthly_counts.index, monthly_counts[‘count’], label=’Actual’, color=’blue’)
plt.plot(forecast_dates_ses, ses_forecast, label=’Forecast’, color=’green’)
plt.title(‘Simple Exponential Smoothing Forecast of Monthly Counts of Police Shootings for the Next 12 Months’)
plt.xlabel(‘Date’)
plt.ylabel(‘Number of Shootings’)
plt.legend()
plt.grid(True)
plt.show()
Here’s the forecast for the next 12 months using the Simple Exponential Smoothing (SES) model:
The blue line represents the actual monthly counts of police shootings up to the present. The green line represents the forecasted monthly counts for the next 12 months. As you can see, the SES model provides a flat forecast, which suggests that the future monthly counts will remain relatively stable and close to the recent observations. This is expected, as SES is particularly suitable for data with no clear trend or seasonality.
project analysis
# Distribution of police shootings based on race and signs of mental illness
plt.figure(figsize=(15, 7))
sns.countplot(data=data, x=’race’, hue=’signs_of_mental_illness’, palette=”viridis”)
plt.title(“Distribution of Police Shootings Based on Race and Signs of Mental Illness”)
plt.ylabel(“Number of Shootings”)
plt.xlabel(“Race”)
plt.legend(title=”Signs of Mental Illness”)
plt.show()
Here’s the distribution of police shootings based on race and the presence or absence of signs of mental illness:
For most racial categories, a larger number of individuals did not show signs of mental illness compared to those who did. The disparity between individuals showing signs of mental illness and those who did not is particularly pronounced in the White and Black categories. This analysis provides insights into the intersection of race and mental health in the context of police shootings.
Wednesday 10/25/23 project analysis
# Distribution of threat level based on race
plt.figure(figsize=(15, 7))
sns.countplot(data=data, x=’threat_level’, hue=’race’, palette=”viridis”)
plt.title(“Distribution of Threat Level Based on Race”)
plt.ylabel(“Number of Shootings”)
plt.xlabel(“Threat Level”)
plt.legend(title=”Race”)
plt.show()
Attack: Most incidents across all racial categories are labeled as “attack”. Within this category, White individuals dominate, followed by Black and Hispanic individuals. Other: The “other” category follows a similar distribution to the “attack” category, with White individuals being the predominant group, followed by Black and Hispanic individuals. Undetermined: The number of incidents labeled as “undetermined” is relatively low across all racial categories. This visualization provides insights into the perceived threat levels associated with different racial categories in police shooting incidents.
wednesday project2 analysis
import matplotlib.pyplot as plt
import seaborn as sns
#Filter data for While and Black individuals
white_age = data_cleaned[data_cleaned[‘race’]==’W’][‘age’]
black_age = data_cleaned[data_cleaned[‘race’]==’B’][‘age’]
#visulization
plt.figure(figsize=(12,6))
sns.histplot(white_age,color=’skyblue’, label=”White”,kde=True,bins=30)
sns.histplot(black_age, color=’salmon’,label=’Black’,kde=True,bins=30)
plt.title(‘Combined Age Distribution for White and Black Individuals’)
plt.xlabel(‘Age’)
plt.ylabel(‘Density’)
plt.legend()
plt.show()
Both White and Black individuals have a peak around the age range of 20-40, which is consistent with the overall age distribution.
The Black individuals’ distribution appears slightly more right-skewed than the White individuals’ distribution.
The extremely small p-value (far below the common significance level of 0.05) indicates that we can reject the null hypothesis. This means there’s a statistically significant difference in the mean ages of White and Black individuals in this dataset
The Monte Carlo simulation results are visualized above:
The histogram represents the distribution of mean differences when randomly shuffling ages without regard to race (White or Black). The red dashed line indicates the observed difference in means between the actual White and Black age samples. From the visualization, we can see that the observed difference is extreme compared to the simulated differences, which suggests a genuine difference in the age distributions between White and Black individuals in the dataset.
The Monte Carlo p-value is 0.0, which indicates that none of the 10,000 simulations produced a difference as extreme as the observed difference, further supporting the conclusion from the t-test.
In summary, both the t-test and the Monte Carlo simulation provide strong evidence that there’s a significant difference in the age distributions of White and Black individuals in this dataset
project analysis
# Calculating the number of people killed according to race
race_counts = data[‘race’].value_counts()
# Calculating the proportion of people killed according to race
race_proportions = race_counts / len(data)
race_proportions_sorted = race_proportions.sort_values(ascending=False)
race_proportions_sorted
us_population_proportions = {
‘W’: 0.62, # White
‘H’: 0.19, # Hispanic
‘B’: 0.13, # Black
‘A’: 0.055, # Asian
‘N’: 0.01, # Native American
‘O’: 0.025, # Other
‘Unknown’: 0 # Unknown (we’ll assume 0 since we don’t have data for this)
}
# Convert the U.S. population proportions dictionary to a Series for correct operations
us_population_proportions_series = pd.Series(us_population_proportions)
# Calculate the proportion of individuals shot by race relative to their estimated population percentage
shooting_proportion_relative = race_proportions / us_population_proportions_series
shooting_proportion_relative_sorted = shooting_proportion_relative.sort_values(ascending=False)
shooting_proportion_relative_sorted
result:
B 1.697653
N 1.312172
H 0.766914
W 0.665156
A 0.293109
O 0.094976
Unknown NaN
dtype: float64
conclusion:
Here are the proportions of individuals shot by race relative to their estimated population percentage in the U.S.:
Black (B): The proportion of Black individuals shot is approximately 1.70 times their representation in the U.S. population.
Native American (N): The proportion of Native American individuals shot is approximately 1.31 times their representation in the U.S. population. Hispanic (H): The proportion of Hispanic individuals shot is approximately 0.77 times their representation in the U.S. population. White (W): The proportion of White individuals shot is approximately 0.67 times their representation in the U.S. population. Asian (A): The proportion of Asian individuals shot is approximately 0.29 times their representation in the U.S. population. Other (O): The proportion of individuals from other racial categories shot is approximately 0.09 times their representation in the U.S. population
11 october post
Geospatial Data Sources:
Online APIs: we can access geospatial data through various APIs, such as Google Maps API, OpenStreetMap, Mapbox, and more.
Government Agencies: Many government agencies provide geospatial data for free or at a low cost. For example, in the United States, you can access data from the US Geological Survey (USGS) or the National Oceanic and Atmospheric Administration (NOAA).
Geospatial Libraries:
GDAL/OGR: The Geospatial Data Abstraction Library (GDAL) and the OGR Simple Feature Library (OGR) are powerful libraries for reading and writing various geospatial file formats.
GeoPandas: GeoPandas extends the capabilities of Pandas to allow for easy manipulation of geospatial data using GeoDataFrames.
Fiona: A Python library for reading and writing vector data (e.g., shapefiles) that is built on top of GDAL.
Shapely: Shapely is a Python library for manipulation and analysis of planar geometric objects, particularly for working with polygons and shapes.
Web Scraping and Data Extraction:
You can use libraries like Beautiful Soup or Scrapy to scrape geospatial data from websites or web services.
Data Repositories:
Many geospatial datasets are available on data repositories like Data.gov, Natural Earth, and others.
Python Modules for Geocoding and Reverse Geocoding:
Libraries like geopy provide geocoding and reverse geocoding capabilities to convert between addresses and geographic coordinates.
Interactive Maps:
Libraries like Folium allow you to create interactive maps with Python.
Here’s a basic example of how you might use GeoPandas to load and work with geospatial data from a shapefile:
python
Copy code
import geopandas as gpd
# Load a shapefile
gdf = gpd.read_file(“path/to/shapefile.shp”)
# Perform geospatial operations
# For example, you can plot the data
gdf.plot()
# Filter and manipulate the data
filtered_data = gdf[gdf[‘population’] > 100000]
# Save the modified data
filtered_data.to_file(“filtered_data.shp”)
i learned about DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm used in data mining and machine learning. It’s particularly useful for identifying clusters of data points in a dataset that have high density, while also being capable of detecting and handling noisy data.
Density-Based Clustering:
DBSCAN is based on the idea of density. It defines clusters as dense regions of data points that are separated by areas of lower density. It doesn’t assume that clusters are necessarily globular or convex in shape, making it suitable for various types of data.
Core Points, Border Points, and Noise:
DBSCAN classifies each data point into one of three categories:
Core Points: A data point is considered a core point if there are at least a specified number of data points (minPts) within a certain distance (epsilon or ε) from it. Core points are at the heart of clusters.
Border Points: A data point is considered a border point if it is within ε distance of a core point but does not have enough core points in its own neighborhood. Border points are on the edge of clusters.
Noise Points: Data points that are neither core nor border points are classified as noise points.
Cluster Formation:
The algorithm starts by selecting an arbitrary data point. If that point is a core point, it creates a new cluster and adds all the core points in its ε-neighborhood to the cluster. This process continues until no more core points can be added to the cluster. Then, the algorithm selects another unvisited point and repeats the process.
Border Points:
Border points can be part of multiple clusters if they are within the ε-distance of multiple core points. They are assigned to the cluster of the first core point they encounter.
Noise Points:
Noise points are data points that are not part of any cluster.
Advantages:
DBSCAN can identify clusters of various shapes and sizes.
It doesn’t require specifying the number of clusters in advance.
It can handle noisy data effectively by classifying outliers as noise points.
Parameters:
The key parameters in DBSCAN are the ε (epsilon) distance threshold and the minPts value. These need to be chosen carefully, as they determine the cluster formation. Tuning these parameters can be a bit challenging in some cases.
Limitations:
DBSCAN may struggle with datasets of varying densities.
It is sensitive to the order in which data points are processed.
Determining the appropriate values for ε and minPts can be tricky.
In summary, DBSCAN is a powerful clustering algorithm that can automatically identify clusters in data based on the density of data points. It is widely used in various fields, such as geospatial analysis, image processing, and anomaly detection.
different methods we can use for missing data since fital shooting data have lot of missing values
Handling missing data is a crucial step in data preprocessing and analysis. There are several methods to deal with missing data, each with its own advantages and disadvantages.
Removal of Missing Data:
Listwise Deletion (Complete Case Analysis): In this method, you simply remove any rows or observations that contain missing values. This is a straightforward approach but can result in a loss of valuable data, especially if a large portion of your data is missing.
Imputation:
Imputation involves filling in missing values with estimated or predicted values. There are several techniques for imputing missing data:
Mean, Median, or Mode Imputation: Replace missing values with the mean, median, or mode of the available values in the variable. This is a simple method but may not be suitable for variables with a skewed distribution.
Constant Value Imputation: Replace missing values with a predetermined constant value, such as zero. This method is straightforward but may introduce bias if the missing data is not missing at random.
Regression Imputation: Use regression analysis to predict missing values based on the relationships between the variable with missing data and other relevant variables. This is a more sophisticated method but requires a strong correlation between variables.
K-Nearest Neighbors (KNN) Imputation: Replace missing values with values from the K-nearest neighbors in the dataset. This method considers the similarity between observations.
Multiple Imputation: This involves creating multiple datasets with imputed values and averaging the results to reduce imputation uncertainty. It’s a more advanced technique and is often preferred when dealing with complex missing data patterns.
Interpolation:
Interpolation methods are used for time-series or sequential data to estimate missing values based on the trend and patterns in the available data.
Linear Interpolation: Estimate missing values by creating a linear relationship between adjacent data points.
Time-Series Methods: Use time-series forecasting techniques like ARIMA or exponential smoothing to predict missing values.
Domain-Specific Methods:
Depending on the specific domain or type of data you’re working with, there may be custom methods for handling missing data. For example, in healthcare, there are specialized imputation methods for medical data.
Data Augmentation:
In machine learning, data augmentation techniques can be used to generate synthetic data points that are similar to the observed data. This can be particularly useful when dealing with image and text data.
Indicator Variables:
Create binary indicator variables to flag the presence or absence of missing data for each variable. This allows you to incorporate information about the missingness in your analysis.
Model-Based Methods:
Model-based imputation involves using machine learning models, such as decision trees or random forests, to predict missing values based on the relationships within the data.
Collect More Data:
In some cases, collecting more data can help reduce the impact of missing values. However, this may not always be feasible.
It’s important to note that the choice of method should be guided by the characteristics of the data and the goals of your analysis. Additionally, understanding the nature of the missing data (missing completely at random, missing at random, or missing not at random) is crucial in selecting the appropriate imputation technique. Multiple imputation and sensitivity analysis can be used to account for missing data mechanisms and assess the robustness of your results.
Dbscan algorithm
DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a clustering algorithm commonly used in data mining and machine learning. It’s designed to group together data points that are close to each other based on their density in the feature space. Here are the key components of DBSCAN:
Core Points: These are data points that have at least a specified number of data points (minPts) within a certain distance (eps) from them. Core points are at the heart of clusters.
Border Points: These points are within the epsilon distance of a core point but do not have enough neighbors to be considered core points themselves. They are on the fringes of clusters.
Noise Points: Data points that are neither core points nor border points are considered noise points. They don’t belong to any cluster.
The DBSCAN algorithm works as follows:
Randomly select an unvisited data point.
If it’s a core point, create a new cluster and add it to the cluster. Then expand the cluster by adding all directly reachable core points and their neighbors to the cluster.
Repeat steps 1 and 2 until all data points have been visited.
Any unvisited data points at this stage are classified as noise.
DBSCAN is effective at discovering clusters of arbitrary shapes and is robust to noise in the data. It doesn’t require specifying the number of clusters beforehand, which makes it a valuable tool for cluster analysis. However, setting the appropriate values for “eps” and “minPts” can be challenging, and the algorithm may not perform well in high-dimensional spaces due to the curse of dimensionality.
project 1
project report snipet
# Calculate skewness and kurtosis for % DIABETIC
skewness_diabetic = skew(df_diabetic[‘% DIABETIC’])
kurtosis_diabetic = kurtosis(df_diabetic[‘% DIABETIC’])
# Calculate skewness and kurtosis for % OBESE
skewness_obese = skew(df_obese[‘% OBESE’])
kurtosis_obese = kurtosis(df_obese[‘% OBESE’])
# Calculate skewness and kurtosis for % INACTIVE
skewness_inactive = skew(df_inactive[‘% INACTIVE’])
kurtosis_inactive = kurtosis(df_inactive[‘% INACTIVE’])
# Print the results
print(f’Skewness of % DIABETIC: {skewness_diabetic:.2f}’)
print(f’Kurtosis of % DIABETIC: {kurtosis_diabetic:.2f}’)
print(f’Skewness of % OBESE: {skewness_obese:.2f}’)
print(f’Kurtosis of % OBESE: {kurtosis_obese:.2f}’)
print(f’Skewness of % INACTIVE: {skewness_inactive:.2f}’)
print(f’Kurtosis of % INACTIVE: {kurtosis_inactive:.2f}’)
results:
Skewness of % DIABETIC: 0.97
Kurtosis of % DIABETIC: 1.03
Skewness of % OBESE: -2.69
Kurtosis of % OBESE: 12.32
Skewness of % INACTIVE: -0.34
Kurtosis of % INACTIVE: -0.55
Explanation:
% DIABETIC:
Skewness: 0.97
Interpretation: The skewness value of 0.97 indicates that the distribution of % DIABETIC data is moderately right-skewed. In practical terms, this means that there is a slight tail on the right side of the distribution, and most data points are concentrated towards the lower end of the scale. It suggests that there may be more data points with lower values for % DIABETIC. Kurtosis: 1.03
Interpretation: The kurtosis value of 1.03 implies that the % DIABETIC data has slightly heavier tails and a slightly less pronounced peak compared to a normal distribution (kurtosis of 3). This suggests that while there may be some outliers, the distribution is relatively close to a normal distribution in terms of tailed Ness and peakiness. % OBESE:
Skewness: -2.69
Interpretation: The skewness value of -2.69 indicates that the distribution of % OBESE data is heavily left-skewed. In practical terms, this means that there is a long tail on the left side of the distribution, and most data points are clustered towards the higher end of the scale. It suggests that there may be a larger number of lower values for % OBESE. Kurtosis: 12.32
Interpretation: The kurtosis value of 12.32 implies that the % OBESE data has very heavy tails and a pronounced peak. This high kurtosis indicates that the distribution has more outliers or extreme values than a typical normal distribution (which has a kurtosis of 3). The distribution is leptokurtic, meaning it has heavier tails and is more peaked than a normal distribution. % INACTIVE:
Skewness: -0.34
Interpretation: The skewness value of -0.34 suggests that the distribution of % INACTIVE data is slightly left-skewed. In practical terms, this means that there is a slight tail on the left side of the distribution, but the majority of data points are concentrated towards the higher end of the scale. It indicates a tendency for more values to be on the higher side of the scale. Kurtosis: -0.55
Interpretation: The kurtosis value of -0.55 indicates that the % INACTIVE data has lighter tails and a less pronounced peak compared to a normal distribution (kurtosis of 3). This suggests that the distribution is relatively flatter and has fewer outliers compared to a typical normal distribution. In summary, the skewness and kurtosis values provide insights into the shape and characteristics of the data distributions. For % DIABETIC, it is moderately right-skewed with a distribution relatively close to normal. For % OBESE, it is heavily left-skewed with very heavy tails and a pronounced peak. For % INACTIVE, it is slightly left-skewed with a distribution that is relatively flatter and less peaked.
more about Bias variance trade off
Why is Bias Variance Tradeoff?
If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.
Total Error
To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.
TotalError = variance+bias^2+ irreducibleerror
An optimal balance of bias and variance would never overfit or underfit the model.
Therefore understanding bias and variance is critical for understanding the behavior of prediction models.
yesterday I learned about K-fold cross validation
K-fold cross-validation is a technique used to assess the performance and reliability of a machine learning model. It involves dividing a dataset into K equal-sized subsets or “folds.” The model is trained and evaluated K times, with each iteration using a different fold as the validation set and the remaining K-1 folds for training. This process helps ensure that the model is tested on various portions of the data, reducing the risk of overfitting and providing a more robust evaluation of its generalization ability.
After each iteration, performance metrics (e.g., accuracy, error) are recorded. The final evaluation is typically the average of these metrics across all K iterations. K-fold cross-validation helps in estimating how well a model will perform on unseen data and aids in hyperparameter tuning. Common values for K are 5 or 10, but the choice depends on the size of the dataset and computational resources. Overall, K-fold cross-validation is a valuable tool for improving model reliability and making more informed decisions in machine learning.
cross validation used in project
Cross-validation is invaluable in machine learning and model development as it provides a robust and unbiased assessment of a model’s performance. By systematically partitioning the dataset into training and testing subsets multiple times, it helps detect issues like overfitting and ensures that the model generalizes well to unseen data. This technique aids in hyperparameter tuning, model selection, and estimating how well the model will perform in real-world applications. It enhances the reliability of performance metrics, reduces the risk of data-driven model biases, and offers a more accurate representation of a model’s true capabilities, making it an essential practice for building effective and trustworthy machine learning models.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns # Import Seaborn
# Feature Engineering
df[‘Interaction’] = df[‘% OBESE’] * df[‘% INACTIVE’]
df[‘% OBESE_squared’] = df[‘% OBESE’] ** 2
df[‘% INACTIVE_squared’] = df[‘% INACTIVE’] ** 2
# Define dependent variable and independent variables
X = df[[‘% OBESE’, ‘% INACTIVE’, ‘Interaction’, ‘% OBESE_squared’, ‘% INACTIVE_squared’]] # Add new features
y = df[‘% DIABETIC’] # Replace with the actual column name
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for numerical features (scaling)
numerical_transformer = Pipeline(steps=[
(‘scaler’, StandardScaler())
])
# Create and fit the Linear Regression model in a pipeline
linear_model = Pipeline(steps=[
(‘preprocessor’, numerical_transformer),
(‘model’, LinearRegression())
])
# Fit the model on the training data
linear_model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = linear_model.predict(X_test)
# Calculate the Mean Squared Error (MSE) on the testing set
mse = mean_squared_error(y_test, y_pred)
print(f”Mean Squared Error: {mse:.2f}”)
# Calculate the R-squared value
r2 = r2_score(y_test, y_pred)
print(f”R-squared: {r2:.2f}”)
# Create a DataFrame containing X_test and predictions for visualization
results_df = pd.DataFrame({‘% OBESE’: X_test[‘% OBESE’],
‘% INACTIVE’: X_test[‘% INACTIVE’],
‘Prediction’: y_pred})
# Create a heatmap to visualize the relationships
heatmap_data = results_df.pivot_table(index=’% OBESE’, columns=’% INACTIVE’, values=’Prediction’)
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, annot=True, cmap=’coolwarm’, linewidths=.5)
plt.title(‘Heatmap of Predictions (Linear Regression)’)
plt.xlabel(‘% INACTIVE’)
plt.ylabel(‘% OBESE’)
plt.show()
Mean Squared Error: 0.43
R-squared: 0.36
After using cross validation we are able to reduce the chances of overfit in our model actually the r-squared value decrease seems we are recuding the overfit to the data
project analysis using heatmap
In the heatmap:
The x-axis represents the ‘% INACTIVE’ feature.
The y-axis represents the ‘% OBESE’ feature.
The color intensity at each point in the heatmap represents the predicted values for ‘% DIABETIC’ for the corresponding combination of ‘% OBESE’ and ‘% INACTIVE’.
By above picture we can show more %obese and more %Inactive most probably the county will be obese.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns # Import Seaborn
# Feature Engineering (if you haven’t already done it)
df[‘Interaction’] = df[‘% OBESE’] * df[‘% INACTIVE’]
df[‘% OBESE_squared’] = df[‘% OBESE’] ** 2
df[‘% INACTIVE_squared’] = df[‘% INACTIVE’] ** 2
# Define your dependent variable and independent variables
X = df[[‘% OBESE’, ‘% INACTIVE’, ‘Interaction’, ‘% OBESE_squared’, ‘% INACTIVE_squared’]] # Add new features
y = df[‘% DIABETIC’] # Replace with the actual column name
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for numerical features (scaling)
numerical_transformer = Pipeline(steps=[
(‘scaler’, StandardScaler())
])
# Create and fit the Linear Regression model in a pipeline
linear_model = Pipeline(steps=[
(‘preprocessor’, numerical_transformer),
(‘model’, LinearRegression())
])
# Fit the model on the training data
linear_model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = linear_model.predict(X_test)
# Calculate the Mean Squared Error (MSE) on the testing set
mse = mean_squared_error(y_test, y_pred)
print(f”Mean Squared Error: {mse:.2f}”)
# Calculate the R-squared value
r2 = r2_score(y_test, y_pred)
print(f”R-squared: {r2:.2f}”)
# Create a DataFrame containing X_test and predictions for visualization
results_df = pd.DataFrame({‘% OBESE’: X_test[‘% OBESE’],
‘% INACTIVE’: X_test[‘% INACTIVE’],
‘Prediction’: y_pred})
# Create a heatmap to visualize the relationships
heatmap_data = results_df.pivot_table(index=’% OBESE’, columns=’% INACTIVE’, values=’Prediction’)
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, annot=True, cmap=’coolwarm’, linewidths=.5)
plt.title(‘Heatmap of Predictions (Linear Regression)’)
plt.xlabel(‘% INACTIVE’)
plt.ylabel(‘% OBESE’)
plt.show()
analysis of premolt and postmolt of crab data
After analysis the carb data below are the finding i found intriguing
To support the recommendation that a minimum shell size limit of at least 145 mm should be imposed for female Dungeness Crabs in fishing regulations, you can perform a t-test to analyze the statistical significance of the difference in molt status between crabs above and below the 145 mm shell size threshold. This will help determine if the majority of crabs above 145 mm indeed did not molt.
Null Hypothesis (H0): There is no significant difference in molt status between female Dungeness Crabs with shell sizes above and below 145 mm.
Alternative Hypothesis (H1): There is a significant difference in molt status between female Dungeness Crabs with shell sizes above and below 145 mm
conclusion:
If the p-value is less than your chosen significance level (e.g., 0.05), you can conclude that there is a significant difference in molt status between the two groups.
If the t-test shows a significant difference and supports the claim that the majority of female Dungeness Crabs with shell sizes above 145 mm did not molt, it would provide statistical evidence to support the recommendation of imposing a minimum shell size limit for fishing regulations.
project analysis using cross validation
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
# Feature Engineering
df[‘Interaction’] = df[‘% OBESE’] * df[‘% INACTIVE’]
df[‘% OBESE_squared’] = df[‘% OBESE’] ** 2
df[‘% INACTIVE_squared’] = df[‘% INACTIVE’] ** 2
# Define your dependent variable and independent variables
X = df[[‘% OBESE’, ‘% INACTIVE’, ‘Interaction’, ‘% OBESE_squared’, ‘% INACTIVE_squared’]] # Add new features
y = df[‘% DIABETIC’] # Replace with the actual column name
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for numerical features (scaling)
numerical_transformer = Pipeline(steps=[
(‘scaler’, StandardScaler())
])
# Create and fit the model in a pipeline
model = Pipeline(steps=[
(‘preprocessor’, numerical_transformer),
(‘model’, LinearRegression())
])
# Calculate the negative MSE using cross-validation
cv_mse_scores = -cross_val_score(model, X, y, cv=5, scoring=’neg_mean_squared_error’)
mse_cv_mean = cv_mse_scores.mean()
print(f”Cross-Validation Mean Squared Error: {mse_cv_mean:.2f}”)
# Fit the model on the training data
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Calculate the MSE on the testing set
mse = mean_squared_error(y_test, y_pred)
print(f”Mean Squared Error: {mse:.2f}”)
# Visualize the results for each independent variable
independent_variables = [‘% OBESE’, ‘% INACTIVE’, ‘Interaction’, ‘% OBESE_squared’, ‘% INACTIVE_squared’]
for variable in independent_variables:
# Scatter plot
plt.scatter(X_test[variable], y_test, color=’blue’, label=’Actual Data’)
# Regression line
x_range = np.linspace(X_test[variable].min(), X_test[variable].max(), num=len(X_test))
coefficients = model.named_steps[‘model’].coef_
intercept = model.named_steps[‘model’].intercept_
y_range = intercept + coefficients[independent_variables.index(variable)] * model.named_steps[‘preprocessor’].transform(X_test)[:, independent_variables.index(variable)]
plt.plot(x_range, y_range, color=’red’, label=’Regression Line’)
plt.xlabel(variable)
plt.ylabel(‘% DIABETIC’)
plt.legend()
plt.show()
The cross-validation mean squared error (CV MSE) is 0.36, and the mean squared error (MSE) on the testing set is 0.43. These values represent the model’s performance in terms of how well it predicts the ‘% DIABETIC’ target variable. Lower values of MSE indicate better predictive performance, so a lower MSE is desirable.
Here’s how you can interpret these results:
Cross-Validation MSE (CV MSE): This metric is calculated using cross-validation, which provides a more robust estimate of the model’s performance. In your case, the CV MSE of 0.36 suggests that, on average, the model’s predictions are off by a squared value of 0.36. Lower CV MSE indicates better model performance.
MSE on the Testing Set: This metric represents how well the model generalizes to new, unseen data (the testing set). A MSE of 0.43 means that, on average, the model’s predictions on the testing set are off by a squared value of 0.43.
In both cases, a lower MSE would be preferred, as it indicates that the model is making more accurate predictions. Additionally, you can further explore model improvement strategies, such as trying different algorithms, hyperparameter tuning, or collecting more data if feasible, to potentially reduce the MSE even further.
project analysis
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
#
# Define your dependent variables and independent variables
X = df[[‘% OBESE’, ‘% INACTIVE’,’FIPS’]] # Replace with the actual column names
y = df[[‘% DIABETIC’]] # Replace with the actual column name
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and fit the multivariate linear regression model
mlr_model = LinearRegression()
mlr_model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = mlr_model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f”Mean Squared Error: {mse:.2f}”)
print(f”R-squared: {r2:.2f}”)
# Interpret the coefficients
coefficients = mlr_model.coef_[0]
intercept = mlr_model.intercept_[0]
print(“Coefficients:”)
for i, dep_var in enumerate([‘%DIABETIC’]):
print(f”Dependent Variable: {dep_var}”)
print(f”Coefficients: {coefficients}”)
print(f”Intercept: {intercept:.2f}”)
i have done some muliti varible linear regression on the cdc data
Overall, with an improved R-squared value of 0.40, your multivariate linear regression model is doing a better job of explaining the variation in %DIABETIC based on the selected independent variables. The coefficients indicate the strength and direction of the relationships between these variables and %DIABETIC. However, as the coefficient for FIPS is very close to zero, it suggests that FIPS may not be a significant predictor in your model.
sep 13 class discussion
Today in class i learned what is p value how it is used to interpreted
Imagine you’re flipping a coin to see if it’s fair (meaning it has an equal chance of landing on heads or tails). If you flip it a few times and it comes up heads most of the time, you might start to wonder if the coin is biased. The p-value is like a tool that helps you decide if the coin is likely biased or if the results could have happened due to random chance.
If the p-value is very low (usually below a certain threshold, like 0.05), it suggests that the results are unlikely to be due to chance, and you might conclude that the coin is probably biased. On the other hand, if the p-value is high (greater than 0.05), it suggests that the results could easily happen by chance alone, so you wouldn’t have strong evidence that the coin is biased.
Lower p-values indicate stronger evidence against randomness, while higher p-values suggest that the observed results might be due to chance.
Scenario: You have a coin, and you want to determine if it’s a fair coin (meaning it has an equal chance of landing on heads or tails). You decide to conduct an experiment by flipping the coin 100 times and recording the results.
Null Hypothesis (H0): The coin is fair; the probability of getting heads or tails is 50% (0.5).
Alternative Hypothesis (Ha): The coin is biased; the probability of getting heads or tails is not 50%.
After conducting your experiment, you observe that the coin landed on heads 60 times and on tails 40 times.
Calculating the p-value: Using a statistical test (like a chi-squared test), you calculate the p-value. In this case, let’s say the p-value is 0.03.
Interpretation: If your chosen significance level (alpha) is 0.05, which is a common threshold, the p-value of 0.03 is less than alpha. This means there’s only a 3% chance of observing such an extreme result (60 heads) if the coin were truly fair (null hypothesis). Because the p-value is less than alpha, you reject the null hypothesis and conclude that the coin may be biased.
In simple terms, the p-value helps you decide if your coin is likely fair (p-value > 0.05) or if there’s evidence that it’s biased (p-value ≤ 0.05) based on your experiment’s results.
In your regression model, we are examining the relationship between the predictor variable “%Inactivity” and the response variable “%Diabetic.” Specifically, you’re interested in checking for homoscedasticity, which means that the variability of the errors (residuals) in our model should be consistent across different levels of “%Inactivity.” In simpler terms, it suggests that the spread of your data points around the regression line should be roughly the same for all values of “%Inactivity.”
Here’s what the results of the Breusch-Pagan test are indicating:
LM Statistic: The LM statistic is a measure used in the Breusch-Pagan test to assess whether there is heteroscedasticity (varying levels of error variance) in your model. A very low p-value (close to 0) for the LM statistic suggests strong evidence against the null hypothesis of homoscedasticity. In our case, the p-value is extremely close to zero (3.607×10^-13), indicating that there is significant evidence that the errors in your model do not have consistent variances across different levels of “%Inactivity.”
F-test: The F-test associated with the Breusch-Pagan test is used to support the LM statistic. In our case, it yields a p-value of 1, which is unusual. Normally, a p-value of 1 would suggest homoscedasticity (consistent error variances). However, you shouldn’t solely rely on the F-test in this situation because the LM test has provided strong evidence against homoscedasticity.
In conclusion, even though the F-test may not suggest heteroscedasticity, the LM test’s extremely low p-value indicates strong evidence that your model does indeed have varying error variances across different levels of “%Inactivity.” As a result, it’s recommended to consider the presence of heteroscedastic errors in your regression model when interpreting the results and making any necessary adjustments or transformations to address this issue.
CDC – data analysis insights for today class form pdf
today we have analysized cdc data in classs some of the the things i found interesting is the data is fanout at extream proints. below is the explation.
Median (Median):the median is 7.45, which suggests that approximately half of the data points in your dataset are below 7.45, and the other half are above it.
Mean (Average): the mean is 7.62883, which is very close to the median. When the mean and median are close, it indicates that the data is likely symmetrically distributed.
Standard Deviation (Stdev): The standard deviation measures the dispersion or spread of the data. In this case, a standard deviation of 1.01628 suggests that the data is relatively tightly clustered around the mean.
Skewness: Skewness measures the asymmetry of the data distribution. A positive skewness value (like 0.658616) suggests that the data is skewed to the right, meaning it has a longer tail on the right side of the distribution. In this case, the data distribution is slightly right-skewed.
Kurtosis: Kurtosis measures the “tailedness” of the data distribution. A higher kurtosis value (like 4.13026) indicates that the distribution has heavier tails and is more peaked around the mean compared to a normal distribution (which has a kurtosis of 3). This suggests that your data has a heavier tail, which means it may have more extreme values
Based on these statistics, it appears that your data is relatively symmetric with a slight right skew and has heavier tails than a normal distribution, possibly indicating the presence of outliers or extreme values in your dataset.
2)Correlation[DiabetesShort〚All, 2〛, Inactivity〚All, 2〛]
Out[242]= 0.441706
a correlation coefficient of 0.441706 suggests a moderately positive linear relationship between “DiabetesShort” and “Inactivity.” This means that as the values of “DiabetesShort” increase, the values of “Inactivity” tend to increase as well, and vice versa, but not perfectly so. The strength of this relationship is moderate, as the coefficient is not close to 1.
3)Heteroscedasticity refers to a pattern of non-constant variance in the errors or residuals of a regression model. In other words, the spread of the residuals systematically changes as you move along the independent variable(s). Detecting heteroscedasticity is essential in regression analysis, as it can violate one of the key assumptions that residuals have constant variance, potentially leading to biased parameter estimates and incorrect inference.
4)The “fanning out” of the residuals as the fitted values increase is a clear indication of heteroscedasticity, which is a significant warning sign that the linear model may not be reliable.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!