Project Overview
The goal of this project was to analyze crash data in New York from September 1, 2019, to August 31, 2021, for four vehicle brands—Nissan, BMW, Audi, and Subaru. The objective was to identify trends over time, examine seasonal patterns, and analyze differences in accident frequency across vehicle types. By investigating these trends, we aimed to understand potential factors contributing to variations in crash occurrences.
Tools and Libraries Used
To achieve this, we used the following libraries:
Pandas: For data cleaning, manipulation, and analysis
Regular Expressions (re): For pattern matching to clean brand names and manage data discrepancies
Matplotlib and Seaborn: For data visualization, enabling clear graphical representation of trends
Python Data Structures (Dictionaries, Lists, DataFrames): For efficient data storage, retrieval, and manipulation
Methodology
Data Cleaning and Preprocessing
Whitespace and Name Matching: We encountered formatting issues with brand names, particularly spaces following brand initials (e.g., "BMW "). Using regular expressions and data-cleaning functions, we standardized brand names to improve analysis accuracy.
Dataframe Structuring: Due to the need to create separate data frames for different queries, we organized data into distinct frames for each vehicle brand and ensured a clear naming convention for easy reference and manipulation.
Query-Specific Analysis
We conducted analyses based on the following queries:Query 1: Monthly Average Crash Counts per Brand
We computed monthly averages to understand year-over-year trends, showing a marked increase in crashes for all brands in 2020, despite pandemic restrictions. For instance, Nissan exhibited the highest average monthly crash rate, while brands like BMW and Subaru had comparatively fewer crashes due to advanced safety features.Query 2: Yearly Trend Analysis (2019-2021)
We analyzed crash trends by brand across each year, revealing:A notable drop in crashes in April 2020, correlating with COVID-19 lockdown measures.
Seasonal spikes in July, August, and October, are traditionally accident-prone months.
Query 3: Crash Distribution by Vehicle Type
Using a pie chart, we found that Sedans and SUVs accounted for the majority of crashes, with Sedans comprising 48.1% and SUVs 37.9% of incidents, likely due to visibility and handling differences between vehicle types.
Visualization and Trend Analysis
We visualized trends using line graphs for each brand across 2019, 2020, and 2021, highlighting recall events and seasonal accident patterns.
A pie chart illustrated the proportional distribution of crashes by vehicle type, showcasing differences in accident susceptibility among vehicle categories.
Insights and Conclusions
Brand-Specific Crash Trends: Nissan recorded the highest accident rate, attributed partly to various recalls and common car usage. In contrast, Subaru, BMW, and Audi, known for safety features, showed lower accident frequencies.
Impact of Pandemic and Seasonal Patterns: Despite fewer cars on the road in 2020, accident rates were elevated, likely influenced by seasonal factors and increased traffic during the summer months.
Vehicle Type Analysis: Sedans were involved in the highest proportion of crashes, a finding that underscores the safety challenges of smaller vehicles compared to SUVs and trucks in urban traffic.
Challenges Encountered
Whitespace in Brand Names: Required custom regular expression handling to identify and standardize entries.
Data Consistency Across Years: To accommodate a non-continuous timeline (partial years), we plotted separate graphs for each year, facilitating clearer comparisons.
Handling Groupby and Count Functions: We combined groupby with counters to calculate specific occurrences, optimizing data extraction for accurate trend analysis.
Summary
This project provided valuable insights into accident trends in New York, highlighting how factors such as vehicle type, brand-specific safety features, and external events (e.g., recalls, the COVID-19 pandemic) influence crash rates. The analysis demonstrates the importance of clean, structured data and effective visualizations in identifying trends and communicating findings.



