#separator:tab
#html:true
#tags column:3
classification: Natural Breaks When natural clusters in data are present
Minimizes variance within classes, max variance between Week01
classification: Equal Interval "You tell # of classes
-: distribution of values not spread equally
+: evenly distributed data (on a histogram)
+: comparing data sets" Week01
classification: Defined interval You tell the width of classes
-: distribution of values also not spread equally across all classes
+: comparing data sets Week01
classification: Quantile Equal number of observations per class
+: symmetrical (normally distributed) data or mild/moderate skew (on histogram)
+: top/bottom percentile of values
-: doesn't consider natural gaps Week01
classification: Standard deviation divides data, categorizes into intervals of standard deviations above/below mean
+: to highlight how far values deviate from average
-: shows as z-scores
-: hard to understand Week01
classification: Geometric class Multiplicatively vary class widths
when data has a highly skewed distribution Week01
central tendency: mode - categorical data
- highest frequency value
central tendency: Qualitative ordinal data, use Median
central tendency: Mean "
- ""typical"" score, best for normally distributed data
- outliers have strong influence
- ex bimodal data /\_......_/\, mean not useful
- ex location, point data only for LA, NYC, mean not useful
- ex calculated value may not be in dataset
"
central tendency: Weighted mean center pulls mean centre towards higher value weights
central tendency: Median center location representing the shortest total distance to all other features
more robust to outliers
central tendency: Central feature Chooses an *existing feature* in dataset that has shortest total distance to ALL other features
dispersion: normally distributed data ~68% within 1 std deviation from mean,
~95% within 2,
~99% within 3
dispersion: Standard distance average distance of each feature to mean center
then use that distance as a radius centered on mean center
(Spatial equiv of std deviation)
Standard deviational ellipse Like standard distance,
but calculate x and y coordinates to those of mean center
Dispersion How spread out/ compact a dataset is around its location of central tendency
Central tendency "Single location to summarize a set of locations, ""typical""/ ""average""/ most representative"
Defining neighbours: Number of neighbours nbh defined by a specified number of features closest to the focal feature
distances can vary depending on density of features
Defining neighbours: Fixed distance all features that fall within specified distance of focal features
num of neighbours depends on density of features in area
Defining neighbours: Network distance Travel routes around focal feature
Fixed distance (no bridge) vs realistic
Contiguity: Raster/Vector polygons vs points Only to polygons, because points have no edges
Delaunay triangulation Contiguity but for points
Generates thiessen polygons on points, then use contiguity edges corners method
Defining neighbours: Contiguity edges (Rook) shared border w/ focal feature considered a neighbour
(directly next to each other)
Defining neighbours: Contiguity edges corners (Queen) border or corner shared feature considered neighbour
Spatial analysis issues - MAUP, neighbourhood definition
- Boundary problem
- Spatial sampling
- Tobler
Cluster analysis finding areas with unexpectedly high values, or finding groups of features with similar characteristics/values/locations, or finding point patterns in the landscape
GIS ML tool types: - prediction
- classification
- clustering
density-based clustering grouping of observations based on feature locations
Tobler's first law of geog Everything is related to everything else,
but near things are more related than distant things
Modifiable Areal Unit Problem (MAUP) Combined effects of scale and aggregation
how things are zoned AND the scale of geographic unit (then aggregated) can change outcomes
explore alternate zoning effects and hierarchical models
Spatial weights matrix File that quantifies spatial relationships (neighbourhood/s) among a set of features
Traditional statistical tests can often be applied to spatial data, BUT... doesn't account for Tobler's first law
and other spatial relationships
also, datasets w/ very different distributions can produce the same summary statistics
central tendency: Median """Middle"" value, often better choice for skewed data
- common for socio{demographic,economic} data
- exact centre of distribution
"
central tendency: What do outliers do? TBA
Dispersion: variance difference between min and max values in a distribution
Heavily impacted by outliers (frequency of values not considered)
Dispersion: Standard deviation sqr(variance)
variance: sum of (observations - mean)2 all / mean
z-scores (Standard score) standard deviations above/below mean
( observation - mean ) / stddev
central tendency: mean centre spatial average (add up)
AI ability of a machine to perform tasks traditionally requiring human intelligence
Machine learning set of tools, algos, and techniques to allow computers to learn patterns in data and acquire info w/o human explicitly programming the process
Deep learning Using trainable algos in the form of artificial neural networks (inspired by how human brain works)
DBSCAN "# of features to be considered a cluster, max search distance- FIXED SEARCH DISTANCE, to find clusters of similar densities
- fastest computationally
"
HDBSCAN uses series of nested clusters and chooses levels that create stable clusters having as many members as possible
- can find clusters of varying densities
- most data-driven
OPTICS Uses reachability plot for distances between neighbours, peaks = big spatial jump, separates clusters
- ex 2 peaks in a row: noise point
- can adjust sensitivity
- most computationally-intensive
DBSCAN vs HDBSCAN DBSCAN struggles w/ different densities unlike HDBSCAN, where search distances can vary
multivariate clustering GROUPING of observations based on feature attributes
trad vs. spatially constrained multivariate clustering traditional:
- not explicitly spatial, but can be applied to spatial problems/ data
- Features in group more alike, but groups may not be spatially contiguous
SC:
- Explicitly incorporate geography
- AKA spatially contiguous groups but maybe less alike
MC: k-means Finds k groups in data based on feature attributes
- think number of variables, plotted on a n-D space, find clusters from there
- boxplot + map (but not inherently spatial)
SC-MvC: Minimum spanning tree features laid out in data / 'spatial' space, connected based on how far they are based on location and attribute values
links are THEN broken in ways that keep clusters as distinct as possible
complete spatial randomness reference spatial distribution, simulates random pattern
you can compare observations to CSR
negative spatial autocorrelation similar values scattered across space, things closer together likely to have diff values
underlying process causes REGULAR DISPERSION
random pattern mix of clustering and dispersion
positive spatial autocorrelation similar values clustered in space
things closer together likely to have similar values
underlying process leads to clustering
p-value probability that the pattern seen is the result of a random process
(low is high certainty that it's not)
Spatial autocorrelation: TEST TYPES - Global: whole study area, generalizes as summary statistic, NO MAP
- Local: global on a subset of the study area
- Scan Statistics: search multiple subsets, return where clustering/dispersal
Global Moran's I Clustering, dispersion, both?
Clustered +ve
Dispersal -ve
z-score: high/low suggests pattern unlikely to be random
Getis-Ord G Clustering?
G: Spatial density, high values indicate high value clustering, low ...
Getis-Ord Local Gi* (Scan statistic) ID Hot and cold spots (clusters) relative to mean ACROSS study area at different confidence levels (p-value)
is local nbh average (z-score) significantly different from global average?
could have low features included in hot spot
Anselin Local Moran's I (LISA) (Scan statistic) Clusters: high-high, low-low
Outliers: high-low, low-high (a feature doesn't match w/ rest of nbh)
Is nbh average significantly different from global average, and
is feature value significantly different from nbh average?
Bivariate Moran's I Clusters across 2 variables
high-high, low-low, high-low, low-high clusters
better than comparing visually
linear mean: orientation vs direction orientation: angle only
direction: average length and angle
Spatiotemporal pattern mining Finding patterns in space and through time by analyzing snapshots of data
Problem w/ thematic/ hotspot map for each period of time Effectiveness of visual analysis decreases as volume of data increases
- difficult to visually quantify trends/change
- assumes each time period is completely independent (Jan doesn't affect Feb)
Time-series analysis looking at how spatial patterns have changed over time
Space-time cube 3D cube
x,y: location grid or polygon aggregation
z: time
Space-time cube: Bin individual unit: unique spatiotemporal extent
Space-time cube: location column of bins: same location, different temporal extents
Space-time cube: Time Slice row of bins that share the same temporal extent (imagine off the top: most recent)
standalone analysis: 1 slice
Space-time cube: Aggregation Polygons: probably should standardize
Bins (same size: no need to standardize):
- fishnet / square grid: quicker than hexagon
- hexagon: more edges = more neighbours, closest approximation to circle that fits nicely
Space-time cube: Modifiable temporal unit problem Similar to MAUP but over time: How we aggregate time can affect our analysis
Data aggregated to months results != years
Consider: seasonality of data, but also more broad patterns
Also: Feb (less days = less car crashes)
Ensure aggregation divides evenly among data or chop off the oldest bit
Emerging hot spot analysis Getis-Ord Gi*: Spatial + temporal (+-1) neighbours, compares to whole study area for hot/cold spots
Difference: z-score, Significance: p-value
Results:
3D: clustering/significance of clustering at each bin over time for entire space-time cube
2D: Top is summarized
17 categories (8+8+1): consider combining/removing categories depending on audience
can also examine a category (why is there oscillation?)
Local Outlier Analysis "Extension of Anselin Local Moran's I: IDs clusters and local outliers in space + time (not as detailed as emerging hotspot analysis)
3D entire output: bin value -- neighbourhood value
2D summary output: no indication of changes to significance of clusters/outliers
- ""only"": has only ever been that
- multiple types, never significant"
Inferential Statistics: categories Parametric statistics: key assumption: data matches some theoretical distribution (ex normal curve)
Non-parametric: no key assumption
Pearson correlation: def Number representing linear correlation between 2 sets of data
Pearson correlation: assumptions - Variables are either interval or ratio (numeric w/ consistent steps, interval: 0 also meaningful)
- variables approx. a normal distribution
- Linear relationship between variables
- No extreme outliers in either dataset
Pearson correlation: outputs Correlation coefficient (-1 < 0 > +1)
- 0: none, 1: perfectly linear
- 0.01-0.30: Weak
- 0.31-0.70: Moderate
- 0.71 - 1.0: Strong
p-value: * = 0.05, **=0.01, ***=0.001 (geoDa)
R^2: coefficient of variation how much variation is explained by figure
variance in dependence var is caused by indep var???
model: fitting data "![]()
- Underfitting, too general, doesn't capture relationships in training data, can't generate useful predictions
- Overfitting: model too complex, cannot generalize so cannot predict
"
Ordinary least squares regression (non-spatial) "- linear regression model (eqn)
- cause/effect unidirectional
- y = strength + x_var(coefficient) + .... + residual
![]()
"
OLS: Assumptions - linear relationship between variables (Pearsons)
- Relationship is homoscedastic (no skew on scatterplot of both)
- No multicollinearity (no correlation between independent variables), less applicable when only 2. why? affects model reliability, risk of *-fitting <--. Test: use pearsons correlation for each pair
- From output: Residuals are normally distributed
- From output: Residals are not spatially autocorrelated (run global Moran's I)
OLS Residuals "Spatial autocorrelation, spatial non-stationarity: map OLS Residuals
"
Spatial stationarity probability of a process being the same throughout space
(a change in one variable will result in a consistent change in another variable no matter where you are in study area)
spatial non-stationarity large clusters of over/under-predictions
indicator that a process operates differently in different parts of study area
Spatial dependency error terms/ variables somehow correlated
relationship between IV/DV not independent at each location
(violates traditional stats, see Tobler)
GWR Geographically weighted regression OLS is global (1 equation for each study area), GWR creates a regression equation for each feature
benefit: accounts for tobler
gives map output?
GWR Considerations - Multicollinearity
- because coefficients are weighted, this can be hard to interpret
- crit by trad stats ppl
- OLS: baseline, GWR tends to overfit
AiCC used to compare quality of a set of models, smaller is better out of the set
OLS Additional assumptions - multicollinearity condition number: concern if above 20
(won't say which pair, use pearson)
- Jarque-Bera: normality of errors: p < 0.05 indicates a lack of normality
- Bruesch-Pagan, Koenker-Bassett: heteroskedascity: p < 0.05 indicates unequal variance, may be a sign of spatial dependence in data
SLM lag: adjacent values are systematically related (autocorrelated), produces model bias
dependent variable in a location is affected by IV of another location, this diffusion can cause spatial lag
Lag coefficient (rho): degree of spatial dependence in data
- if likelihood ration test still significant, still issue of spatial dependence
SEM nuisance (error): problem w/ model residuals only, not variables. correct by including a spatial error term in the model
- lag coefficient lambda: degree of spatial dependence in the residuals
- still spatial dependence (likelihood ratio test)