The most robust, publicly available crime dataset in the US is the annual Uniform Crime Reporting Program (UCR) data produced by the FBI. The FBI’s Uniform Crime Reporting Program is a nationwide, cooperative statistical effort of more than 18,000 city, university and college, county, state, tribal, and federal law enforcement agencies voluntarily reporting data on crimes brought to their attention. The UCR administrative records are the most nationally representative, small area crime data that are publicly available. Some limitations of the UCR data may include a coverage bias in both the quantity of agencies reporting, and the number of reported crimes.
CrimeIndex (US) methodology evolved to include the use of crime incident data collated from law enforcement agencies (referred to in this document as incident data). Open data sources have risen to 40 agencies. With these granular data a foundational database of crime incidents data was built to gain further insight into crime patterns in urban areas. Please refer to the Notes section for information on making comparisons with previous versions of CrimeIndex (US).
Precisely data scientists combined the UCR, incident data, and Precisely Location Intelligence data and employed a multi-level statistical model to estimate crime rates by crime type at the block group unit of analysis.
First, the UCR data were extracted from the FBI website and the incidents data were sourced from multiple agencies. Second, the names associated with the UCR crime reporting entities were matched to Precisely’s inventory of geographic data using exact and probabilistic matching techniques. Disparate incident datasets were analyzed and combined, and incident descriptions matched to UCR crime types using multiple string-matching techniques. Once matched, incident data was aggregated to block group level. Outlier analysis of both UCR and incident datasets was carried out to detect and remove erroneous data.
After these data were linked to Precisely’s Location Intelligence data and erroneous data were removed, crime rates per capita were calculated. Next, statistical techniques were used to impute crime rates for a small number of areas of the USA where UCR crime statistics were unavailable.
The prepared data were next combined with Precisely proprietary datasets. These datasets were used to understand and predict the relationship between crime (using UCR aggregate data and incident data) and geodemographic location data. A series of regression models were developed to determine the most relevant features for different crime types and different geographies. These models were then deployed at the block group level to predict where crime rates are likely to be higher or lower.
No Census-based race, ethnicity, gender or language datasets were used to develop the CrimeIndex (US).
The final estimates were derived by combining the macro level UCR crime statistics and the regression models built with both the UCR and incident data. The final composite crime score reflects the linear combination of state specific crime distributions by crime type data from the UCR data and the combined macro-and-regression models. The qualitative categories were derived from the final quantitative scores using the percentile distribution above and below the national average to acknowledge the non-normal distribution of crime rates. State-wise indices have also been calculated for the user to perform within-state comparison of block groups.