project analsis

# Filling missing values based on the mentioned strategies

# Filling ‘flee’ with ‘unknown’
data[‘flee’].fillna(‘unknown’, inplace=True)

# Filling ‘age’ with the median
data[‘age’].fillna(data[‘age’].median(), inplace=True)

# Filling ‘name’ with ‘Unknown’
data[‘name’].fillna(‘Unknown’, inplace=True)

# Filling ‘armed’ with ‘undetermined’
data[‘armed’].fillna(‘undetermined’, inplace=True)

# Filling ‘city’ with ‘Unknown_city’
data[‘city’].fillna(‘Unknown_city’, inplace=True)

# Filling ‘gender’ with ‘undetermined’
data[‘gender’].fillna(‘undetermined’, inplace=True)

# Checking if all missing values are filled
missing_after_fill = data.isnull().sum()
missing_after_fill[missing_after_fill > 0]

strategies used for missing values in project

project analysis

# Distribution of armed status based on race

plt.figure(figsize=(15, 8))
sns.countplot(data=data, y=’armed’, hue=’race’, order=data[‘armed’].value_counts().iloc[:10].index, palette=”viridis”)
plt.title(“Distribution of Armed Status Based on Race (Top 10 Armed Categories)”)
plt.xlabel(“Number of Shootings”)
plt.ylabel(“Armed Status”)
plt.legend(title=”Race”, loc=”lower right”)
plt.show()

Guns: Predominantly, individuals from the White, Black, and Hispanic racial categories are armed with guns. Knives: A significant number of White individuals are armed with knives, followed by Black and Hispanic individuals. Unarmed: A concerning observation is the number of unarmed Black individuals, which is notably high compared to other racial categories. Other categories: Similar trends are observed across other categories, with White individuals being the predominant group, followed by Black and Hispanic individuals. This visualization provides insights into the disparities and patterns related to the armed status of individuals across different racial categories

project analysis

from statsmodels.tsa.holtwinters import SimpleExpSmoothing

# Fitting the Simple Exponential Smoothing model
ses_model = SimpleExpSmoothing(monthly_counts[‘count’]).fit()

# Forecasting for the next 12 months
ses_forecast = ses_model.forecast(steps=12)

# Dates for the forecasted period
forecast_dates_ses = pd.date_range(monthly_counts.index[-1] + pd.DateOffset(months=1), periods=12, freq=’M’)

# Plotting the results
plt.figure(figsize=(15, 7))
plt.plot(monthly_counts.index, monthly_counts[‘count’], label=’Actual’, color=’blue’)
plt.plot(forecast_dates_ses, ses_forecast, label=’Forecast’, color=’green’)
plt.title(‘Simple Exponential Smoothing Forecast of Monthly Counts of Police Shootings for the Next 12 Months’)
plt.xlabel(‘Date’)
plt.ylabel(‘Number of Shootings’)
plt.legend()
plt.grid(True)
plt.show()

Here’s the forecast for the next 12 months using the Simple Exponential Smoothing (SES) model:

The blue line represents the actual monthly counts of police shootings up to the present. The green line represents the forecasted monthly counts for the next 12 months. As you can see, the SES model provides a flat forecast, which suggests that the future monthly counts will remain relatively stable and close to the recent observations. This is expected, as SES is particularly suitable for data with no clear trend or seasonality.

project analysis

# Distribution of police shootings based on race and signs of mental illness

plt.figure(figsize=(15, 7))
sns.countplot(data=data, x=’race’, hue=’signs_of_mental_illness’, palette=”viridis”)
plt.title(“Distribution of Police Shootings Based on Race and Signs of Mental Illness”)
plt.ylabel(“Number of Shootings”)
plt.xlabel(“Race”)
plt.legend(title=”Signs of Mental Illness”)
plt.show()

Here’s the distribution of police shootings based on race and the presence or absence of signs of mental illness:

For most racial categories, a larger number of individuals did not show signs of mental illness compared to those who did. The disparity between individuals showing signs of mental illness and those who did not is particularly pronounced in the White and Black categories. This analysis provides insights into the intersection of race and mental health in the context of police shootings.

Wednesday 10/25/23 project analysis

# Distribution of threat level based on race

plt.figure(figsize=(15, 7))
sns.countplot(data=data, x=’threat_level’, hue=’race’, palette=”viridis”)
plt.title(“Distribution of Threat Level Based on Race”)
plt.ylabel(“Number of Shootings”)
plt.xlabel(“Threat Level”)
plt.legend(title=”Race”)
plt.show()
Attack: Most incidents across all racial categories are labeled as “attack”. Within this category, White individuals dominate, followed by Black and Hispanic individuals. Other: The “other” category follows a similar distribution to the “attack” category, with White individuals being the predominant group, followed by Black and Hispanic individuals. Undetermined: The number of incidents labeled as “undetermined” is relatively low across all racial categories. This visualization provides insights into the perceived threat levels associated with different racial categories in police shooting incidents.

wednesday project2 analysis

import matplotlib.pyplot as plt
import seaborn as sns

#Filter data for While and Black individuals
white_age = data_cleaned[data_cleaned[‘race’]==’W’][‘age’]
black_age = data_cleaned[data_cleaned[‘race’]==’B’][‘age’]

#visulization
plt.figure(figsize=(12,6))
sns.histplot(white_age,color=’skyblue’, label=”White”,kde=True,bins=30)
sns.histplot(black_age, color=’salmon’,label=’Black’,kde=True,bins=30)
plt.title(‘Combined Age Distribution for White and Black Individuals’)
plt.xlabel(‘Age’)
plt.ylabel(‘Density’)
plt.legend()
plt.show()

Both White and Black individuals have a peak around the age range of 20-40, which is consistent with the overall age distribution.
The Black individuals’ distribution appears slightly more right-skewed than the White individuals’ distribution.

The extremely small p-value (far below the common significance level of 0.05) indicates that we can reject the null hypothesis. This means there’s a statistically significant difference in the mean ages of White and Black individuals in this dataset

The Monte Carlo simulation results are visualized above:

The histogram represents the distribution of mean differences when randomly shuffling ages without regard to race (White or Black). The red dashed line indicates the observed difference in means between the actual White and Black age samples. From the visualization, we can see that the observed difference is extreme compared to the simulated differences, which suggests a genuine difference in the age distributions between White and Black individuals in the dataset.

The Monte Carlo p-value is 0.0, which indicates that none of the 10,000 simulations produced a difference as extreme as the observed difference, further supporting the conclusion from the t-test.

In summary, both the t-test and the Monte Carlo simulation provide strong evidence that there’s a significant difference in the age distributions of White and Black individuals in this dataset

project analysis

# Calculating the number of people killed according to race
race_counts = data[‘race’].value_counts()

# Calculating the proportion of people killed according to race
race_proportions = race_counts / len(data)

race_proportions_sorted = race_proportions.sort_values(ascending=False)
race_proportions_sorted

us_population_proportions = {
‘W’: 0.62, # White
‘H’: 0.19, # Hispanic
‘B’: 0.13, # Black
‘A’: 0.055, # Asian
‘N’: 0.01, # Native American
‘O’: 0.025, # Other
‘Unknown’: 0 # Unknown (we’ll assume 0 since we don’t have data for this)
}

# Convert the U.S. population proportions dictionary to a Series for correct operations
us_population_proportions_series = pd.Series(us_population_proportions)

# Calculate the proportion of individuals shot by race relative to their estimated population percentage
shooting_proportion_relative = race_proportions / us_population_proportions_series

shooting_proportion_relative_sorted = shooting_proportion_relative.sort_values(ascending=False)
shooting_proportion_relative_sorted

result:
B 1.697653
N 1.312172
H 0.766914
W 0.665156
A 0.293109
O 0.094976
Unknown NaN
dtype: float64

conclusion:
Here are the proportions of individuals shot by race relative to their estimated population percentage in the U.S.:

Black (B): The proportion of Black individuals shot is approximately 1.70 times their representation in the U.S. population.
Native American (N): The proportion of Native American individuals shot is approximately 1.31 times their representation in the U.S. population. Hispanic (H): The proportion of Hispanic individuals shot is approximately 0.77 times their representation in the U.S. population. White (W): The proportion of White individuals shot is approximately 0.67 times their representation in the U.S. population. Asian (A): The proportion of Asian individuals shot is approximately 0.29 times their representation in the U.S. population. Other (O): The proportion of individuals from other racial categories shot is approximately 0.09 times their representation in the U.S. population

11 october post

Geospatial Data Sources:

Online APIs: we can access geospatial data through various APIs, such as Google Maps API, OpenStreetMap, Mapbox, and more.
Government Agencies: Many government agencies provide geospatial data for free or at a low cost. For example, in the United States, you can access data from the US Geological Survey (USGS) or the National Oceanic and Atmospheric Administration (NOAA).
Geospatial Libraries:

GDAL/OGR: The Geospatial Data Abstraction Library (GDAL) and the OGR Simple Feature Library (OGR) are powerful libraries for reading and writing various geospatial file formats.
GeoPandas: GeoPandas extends the capabilities of Pandas to allow for easy manipulation of geospatial data using GeoDataFrames.
Fiona: A Python library for reading and writing vector data (e.g., shapefiles) that is built on top of GDAL.
Shapely: Shapely is a Python library for manipulation and analysis of planar geometric objects, particularly for working with polygons and shapes.
Web Scraping and Data Extraction:

You can use libraries like Beautiful Soup or Scrapy to scrape geospatial data from websites or web services.
Data Repositories:

Many geospatial datasets are available on data repositories like Data.gov, Natural Earth, and others.
Python Modules for Geocoding and Reverse Geocoding:

Libraries like geopy provide geocoding and reverse geocoding capabilities to convert between addresses and geographic coordinates.
Interactive Maps:

Libraries like Folium allow you to create interactive maps with Python.
Here’s a basic example of how you might use GeoPandas to load and work with geospatial data from a shapefile:

python
Copy code
import geopandas as gpd

# Load a shapefile
gdf = gpd.read_file(“path/to/shapefile.shp”)

# Perform geospatial operations
# For example, you can plot the data
gdf.plot()

# Filter and manipulate the data
filtered_data = gdf[gdf[‘population’] > 100000]

# Save the modified data
filtered_data.to_file(“filtered_data.shp”)

i learned about DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm used in data mining and machine learning. It’s particularly useful for identifying clusters of data points in a dataset that have high density, while also being capable of detecting and handling noisy data.

Density-Based Clustering:
DBSCAN is based on the idea of density. It defines clusters as dense regions of data points that are separated by areas of lower density. It doesn’t assume that clusters are necessarily globular or convex in shape, making it suitable for various types of data.

Core Points, Border Points, and Noise:
DBSCAN classifies each data point into one of three categories:

Core Points: A data point is considered a core point if there are at least a specified number of data points (minPts) within a certain distance (epsilon or ε) from it. Core points are at the heart of clusters.
Border Points: A data point is considered a border point if it is within ε distance of a core point but does not have enough core points in its own neighborhood. Border points are on the edge of clusters.
Noise Points: Data points that are neither core nor border points are classified as noise points.
Cluster Formation:
The algorithm starts by selecting an arbitrary data point. If that point is a core point, it creates a new cluster and adds all the core points in its ε-neighborhood to the cluster. This process continues until no more core points can be added to the cluster. Then, the algorithm selects another unvisited point and repeats the process.

Border Points:
Border points can be part of multiple clusters if they are within the ε-distance of multiple core points. They are assigned to the cluster of the first core point they encounter.

Noise Points:
Noise points are data points that are not part of any cluster.

Advantages:

DBSCAN can identify clusters of various shapes and sizes.
It doesn’t require specifying the number of clusters in advance.
It can handle noisy data effectively by classifying outliers as noise points.
Parameters:
The key parameters in DBSCAN are the ε (epsilon) distance threshold and the minPts value. These need to be chosen carefully, as they determine the cluster formation. Tuning these parameters can be a bit challenging in some cases.

Limitations:

DBSCAN may struggle with datasets of varying densities.
It is sensitive to the order in which data points are processed.
Determining the appropriate values for ε and minPts can be tricky.
In summary, DBSCAN is a powerful clustering algorithm that can automatically identify clusters in data based on the density of data points. It is widely used in various fields, such as geospatial analysis, image processing, and anomaly detection.

different methods we can use for missing data since fital shooting data have lot of missing values

Handling missing data is a crucial step in data preprocessing and analysis. There are several methods to deal with missing data, each with its own advantages and disadvantages.

Removal of Missing Data:

Listwise Deletion (Complete Case Analysis): In this method, you simply remove any rows or observations that contain missing values. This is a straightforward approach but can result in a loss of valuable data, especially if a large portion of your data is missing.
Imputation:
Imputation involves filling in missing values with estimated or predicted values. There are several techniques for imputing missing data:

Mean, Median, or Mode Imputation: Replace missing values with the mean, median, or mode of the available values in the variable. This is a simple method but may not be suitable for variables with a skewed distribution.

Constant Value Imputation: Replace missing values with a predetermined constant value, such as zero. This method is straightforward but may introduce bias if the missing data is not missing at random.

Regression Imputation: Use regression analysis to predict missing values based on the relationships between the variable with missing data and other relevant variables. This is a more sophisticated method but requires a strong correlation between variables.

K-Nearest Neighbors (KNN) Imputation: Replace missing values with values from the K-nearest neighbors in the dataset. This method considers the similarity between observations.

Multiple Imputation: This involves creating multiple datasets with imputed values and averaging the results to reduce imputation uncertainty. It’s a more advanced technique and is often preferred when dealing with complex missing data patterns.

Interpolation:
Interpolation methods are used for time-series or sequential data to estimate missing values based on the trend and patterns in the available data.

Linear Interpolation: Estimate missing values by creating a linear relationship between adjacent data points.

Time-Series Methods: Use time-series forecasting techniques like ARIMA or exponential smoothing to predict missing values.

Domain-Specific Methods:
Depending on the specific domain or type of data you’re working with, there may be custom methods for handling missing data. For example, in healthcare, there are specialized imputation methods for medical data.

Data Augmentation:
In machine learning, data augmentation techniques can be used to generate synthetic data points that are similar to the observed data. This can be particularly useful when dealing with image and text data.

Indicator Variables:
Create binary indicator variables to flag the presence or absence of missing data for each variable. This allows you to incorporate information about the missingness in your analysis.

Model-Based Methods:
Model-based imputation involves using machine learning models, such as decision trees or random forests, to predict missing values based on the relationships within the data.

Collect More Data:
In some cases, collecting more data can help reduce the impact of missing values. However, this may not always be feasible.

It’s important to note that the choice of method should be guided by the characteristics of the data and the goals of your analysis. Additionally, understanding the nature of the missing data (missing completely at random, missing at random, or missing not at random) is crucial in selecting the appropriate imputation technique. Multiple imputation and sensitivity analysis can be used to account for missing data mechanisms and assess the robustness of your results.

Dbscan algorithm

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a clustering algorithm commonly used in data mining and machine learning. It’s designed to group together data points that are close to each other based on their density in the feature space. Here are the key components of DBSCAN:

Core Points: These are data points that have at least a specified number of data points (minPts) within a certain distance (eps) from them. Core points are at the heart of clusters.

Border Points: These points are within the epsilon distance of a core point but do not have enough neighbors to be considered core points themselves. They are on the fringes of clusters.

Noise Points: Data points that are neither core points nor border points are considered noise points. They don’t belong to any cluster.

The DBSCAN algorithm works as follows:

Randomly select an unvisited data point.

If it’s a core point, create a new cluster and add it to the cluster. Then expand the cluster by adding all directly reachable core points and their neighbors to the cluster.

Repeat steps 1 and 2 until all data points have been visited.

Any unvisited data points at this stage are classified as noise.

DBSCAN is effective at discovering clusters of arbitrary shapes and is robust to noise in the data. It doesn’t require specifying the number of clusters beforehand, which makes it a valuable tool for cluster analysis. However, setting the appropriate values for “eps” and “minPts” can be challenging, and the algorithm may not perform well in high-dimensional spaces due to the curse of dimensionality.

project report snipet

# Calculate skewness and kurtosis for % DIABETIC
skewness_diabetic = skew(df_diabetic[‘% DIABETIC’])
kurtosis_diabetic = kurtosis(df_diabetic[‘% DIABETIC’])

# Calculate skewness and kurtosis for % OBESE
skewness_obese = skew(df_obese[‘% OBESE’])
kurtosis_obese = kurtosis(df_obese[‘% OBESE’])

# Calculate skewness and kurtosis for % INACTIVE
skewness_inactive = skew(df_inactive[‘% INACTIVE’])
kurtosis_inactive = kurtosis(df_inactive[‘% INACTIVE’])

# Print the results
print(f’Skewness of % DIABETIC: {skewness_diabetic:.2f}’)
print(f’Kurtosis of % DIABETIC: {kurtosis_diabetic:.2f}’)

print(f’Skewness of % OBESE: {skewness_obese:.2f}’)
print(f’Kurtosis of % OBESE: {kurtosis_obese:.2f}’)

print(f’Skewness of % INACTIVE: {skewness_inactive:.2f}’)
print(f’Kurtosis of % INACTIVE: {kurtosis_inactive:.2f}’)
results:
Skewness of % DIABETIC: 0.97
Kurtosis of % DIABETIC: 1.03
Skewness of % OBESE: -2.69
Kurtosis of % OBESE: 12.32
Skewness of % INACTIVE: -0.34
Kurtosis of % INACTIVE: -0.55
Explanation:
% DIABETIC:

Skewness: 0.97

Interpretation: The skewness value of 0.97 indicates that the distribution of % DIABETIC data is moderately right-skewed. In practical terms, this means that there is a slight tail on the right side of the distribution, and most data points are concentrated towards the lower end of the scale. It suggests that there may be more data points with lower values for % DIABETIC. Kurtosis: 1.03

Interpretation: The kurtosis value of 1.03 implies that the % DIABETIC data has slightly heavier tails and a slightly less pronounced peak compared to a normal distribution (kurtosis of 3). This suggests that while there may be some outliers, the distribution is relatively close to a normal distribution in terms of tailed Ness and peakiness. % OBESE:

Skewness: -2.69

Interpretation: The skewness value of -2.69 indicates that the distribution of % OBESE data is heavily left-skewed. In practical terms, this means that there is a long tail on the left side of the distribution, and most data points are clustered towards the higher end of the scale. It suggests that there may be a larger number of lower values for % OBESE. Kurtosis: 12.32

Interpretation: The kurtosis value of 12.32 implies that the % OBESE data has very heavy tails and a pronounced peak. This high kurtosis indicates that the distribution has more outliers or extreme values than a typical normal distribution (which has a kurtosis of 3). The distribution is leptokurtic, meaning it has heavier tails and is more peaked than a normal distribution. % INACTIVE:

Skewness: -0.34

Interpretation: The skewness value of -0.34 suggests that the distribution of % INACTIVE data is slightly left-skewed. In practical terms, this means that there is a slight tail on the left side of the distribution, but the majority of data points are concentrated towards the higher end of the scale. It indicates a tendency for more values to be on the higher side of the scale. Kurtosis: -0.55

Interpretation: The kurtosis value of -0.55 indicates that the % INACTIVE data has lighter tails and a less pronounced peak compared to a normal distribution (kurtosis of 3). This suggests that the distribution is relatively flatter and has fewer outliers compared to a typical normal distribution. In summary, the skewness and kurtosis values provide insights into the shape and characteristics of the data distributions. For % DIABETIC, it is moderately right-skewed with a distribution relatively close to normal. For % OBESE, it is heavily left-skewed with very heavy tails and a pronounced peak. For % INACTIVE, it is slightly left-skewed with a distribution that is relatively flatter and less peaked.

more about Bias variance trade off

Why is Bias Variance Tradeoff?

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.

Total Error

To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.

TotalError = variance+bias^2+ irreducibleerror
An optimal balance of bias and variance would never overfit or underfit the model.

Therefore understanding bias and variance is critical for understanding the behavior of prediction models.