#separator:tab
#html:true
classification: Natural Breaks When natural clusters in data are present
Minimizes variance within classes, max variance between
classification: Equal Interval "You tell # of classes
-: distribution of values not spread equally
+: evenly distributed data (on a histogram)
+: comparing data sets"
classification: Defined interval You tell the width of classes
-: distribution of values also not spread equally across all classes
+: comparing data sets
classification: Quantile Equal number of observations per class
+: symmetrical (normally distributed) data or mild/moderate skew (on histogram)
+: top/bottom percentile of values
-: doesn't consider natural gaps
classification: Standard deviation divides data, categorizes into intervals of standard deviations above/below mean
+: to highlight how far values deviate from average
-: shows as z-scores
-: hard to understand
classification: Geometric class Multiplicatively vary class widths
when data has a highly skewed distribution
central tendency: mode - categorical data
- highest frequency value
central tendency: Qualitative ordinal data, use Median
central tendency: Mean "
- ""typical"" score, best for normally distributed data
- outliers have strong influence
- ex bimodal data /\_......_/\, mean not useful
- ex location, point data only for LA, NYC, mean not useful
- ex calculated value may not be in dataset
"
central tendency: Weighted mean center pulls mean centre towards higher value weights
central tendency: Median center location representing the shortest total distance to all other features
more robust to outliers
central tendency: Central feature Chooses an *existing feature* in dataset that has shortest total distance to ALL other features
dispersion: normally distributed data ~68% within 1 std deviation from mean,
~95% within 2,
~99% within 3
dispersion: Standard distance average distance of each feature to mean center
then use that distance as a radius centered on mean center
(Spatial equiv of std deviation)
Standard deviational ellipse Like standard distance,
but calculate x and y coordinates to those of mean center
Dispersion How spread out/ compact a dataset is around its location of central tendency
Central tendency "Single location to summarize a set of locations, ""typical""/ ""average""/ most representative"
Defining neighbours: Number of neighbours nbh defined by a specified number of features closest to the focal feature
distances can vary depending on density of features
Defining neighbours: Fixed distance all features that fall within specified distance of focal features
num of neighbours depends on density of features in area
Defining neighbours: Network distance Travel routes around focal feature
Fixed distance (no bridge) vs realistic
Contiguity: Raster/Vector polygons vs points Only to polygons, because points have no edges
Delaunay triangulation Contiguity but for points
Generates thiessen polygons on points, then use contiguity edges corners method
Defining neighbours: Contiguity edges (Rook) shared border w/ focal feature considered a neighbour
(directly next to each other)
Defining neighbours: Contiguity edges corners (Queen) border or corner shared feature considered neighbour
Spatial analysis issues - MAUP, neighbourhood definition
- Boundary problem
- Spatial sampling
- Tobler
Cluster analysis finding areas with unexpectedly high values, or finding groups of features with similar characteristics/values/locations, or finding point patterns in the landscape
GIS ML tool types: - prediction
- classification
- clustering
density-based clustering grouping of observations based on feature locations
Tobler's first law of geog Everything is related to everything else,
but near things are more related than distant things
Modifiable Areal Unit Problem (MAUP) Combined effects of scale and aggregation
how things are zoned AND the scale of geographic unit (then aggregated) can change outcomes
explore alternate zoning effects and hierarchical models
Spatial weights matrix File that quantifies spatial relationships (neighbourhood/s) among a set of features
Traditional statistical tests can often be applied to spatial data, BUT... doesn't account for Tobler's first law
and other spatial relationships
also, datasets w/ very different distributions can produce the same summary statistics
central tendency: Median """Middle"" value, often better choice for skewed data
- common for socio{demographic,economic} data
- exact centre of distribution
"
central tendency: What do outliers do?
Dispersion: variance difference between min and max values in a distribution
Heavily impacted by outliers (frequency of values not considered)
Dispersion: Standard deviation sqr(variance)
variance: sum of (observations - mean)2 all / mean
z-scores (Standard score) standard deviations above/below mean
( observation - mean ) / stddev
central tendency: mean centre spatial average (add up)
AI ability of a machine to perform tasks traditionally requiring human intelligence
Machine learning set of tools, algos, and techniques to allow computers to learn patterns in data and acquire info w/o human explicitly programming the process
Deep learning Using trainable algos in the form of artificial neural networks (inspired by how human brain works)
DBSCAN "# of features to be considered a cluster, max search distance- FIXED SEARCH DISTANCE, to find clusters of similar densities
- fastest computationally
"
HDBSCAN uses series of nested clusters and chooses levels that create stable clusters having as many members as possible
- can find clusters of varying densities
- most data-driven
OPTICS Uses reachability plot for distances between neighbours, peaks = big spatial jump, separates clusters
- ex 2 peaks in a row: noise point
- can adjust sensitivity
- most computationally-intensive
DBSCAN vs HDBSCAN DBSCAN struggles w/ different densities unlike HDBSCAN, where search distances can vary
multivariate clustering GROUPING of observations based on feature attributes
trad vs. spatially constrained multivariate clustering traditional:
- not explicitly spatial, but can be applied to spatial problems/ data
- Features in group more alike, but groups may not be spatially contiguous
SC:
- Explicitly incorporate geography
- AKA spatially contiguous groups but maybe less alike
MC: k-means Finds k groups in data based on feature attributes
- think number of variables, plotted on a n-D space, find clusters from there
- boxplot + map (but not inherently spatial)
SC-MvC: Minimum spanning tree features laid out in data / 'spatial' space, connected based on how far they are based on location and attribute values
links are THEN broken in ways that keep clusters as distinct as possible
complete spatial randomness reference spatial distribution, simulates random pattern
you can compare observations to CSR
negative spatial autocorrelation similar values scattered across space, things closer together likely to have diff values
underlying process causes REGULAR DISPERSION
random pattern mix of clustering and dispersion
positive spatial autocorrelation similar values clustered in space
things closer together likely to have similar values
underlying process leads to clustering
p-value probability that the pattern seen is the result of a random process
(low is high certainty that it's not)
Spatial autocorrelation: TEST TYPES - Global: whole study area, generalizes as summary statistic, NO MAP
- Local: global on a subset of the study area
- Scan Statistics: search multiple subsets, return where clustering/dispersal
Moran's I Clustering, dispersion, both?
Clustered +ve
Dispersal -ve
z-score: high/low suggests pattern unlikely to be random
Getis-Ord G Clustering?
G: Spatial density, high values indicate high value clustering, low ...
Getis-Ord Local Gi* (Scan statistic) ID Hot and cold spots (clusters) relative to mean ACROSS study area at different confidence levels (p-value)
is local nbh average (z-score) significantly different from global average?
could have low features included in hot spot
Anselin Local Moran's I (LISA) (Scan statistic) Clusters: high-high, low-low
Outliers: high-low, low-high (a feature doesn't match w/ rest of nbh)
Is nbh average significantly different from global average, and
is feature value significantly different from nbh average?
Bivariate Moran's I Clusters across 2 variables
high-high, low-low, high-low, low-high clusters
better than comparing visually
linear mean: orientation vs direction orientation: angle only
direction: average length and angle
Spatiotemporal pattern mining Finding patterns in space and through time by analyzing snapshots of data
Problem w/ thematic/ hotspot map for each period of time Effectiveness of visual analysis decreases as volume of data increases
- difficult to visually quantify trends/change
- assumes each time period is completely independent (Jan doesn't affect Feb)
Time-series analysis looking at how spatial patterns have changed over time
Space-time cube 3D cube
x,y: location grid or polygon aggregation
z: time
Space-time cube: Bin individual unit: unique spatiotemporal extent
Space-time cube: location column of bins: same location, different temporal extents
Space-time cube: Time Slice row of bins that share the same temporal extent (imagine off the top: most recent)
standalone analysis: 1 slice
Space-time cube: Aggregation Polygons: probably should standardize
Bins (same size: no need to standardize):
- fishnet / square grid: quicker than hexagon
- hexagon: more edges = more neighbours, closest approximation to circle that fits nicely
Space-time cube: Modifiable temporal unit problem Similar to MAUP but over time: How we aggregate time can affect our analysis
Data aggregated to months results != years
Consider: seasonality of data, but also more broad patterns
Also: Feb (less days = less car crashes)
Ensure aggregation divides evenly among data or chop off the oldest bit
Emerging hot spot analysis Getis-Ord Gi*: Spatial + temporal (+-1) neighbours, compares to whole study area for hot/cold spots
Difference: z-score, Significance: p-value
Results:
3D: clustering/significance of clustering at each bin over time for entire space-time cube
2D: Top is summarized
17 categories (8+8+1): consider combining/removing categories depending on audience
can also examine a category (why is there oscillation?)
Local Outlier Analysis "Extension of Anselin Local Moran's I: IDs clusters and local outliers in space + time (not as detailed as emerging hotspot analysis)
3D entire output: bin value -- neighbourhood value
2D summary output: no indication of changes to significance of clusters/outliers
- ""only"": has only ever been that
- multiple types, never significant"