#separator:tab #html:true #tags column:3 classification: Natural Breaks When natural clusters in data are present
Minimizes variance within classes, max variance between Week01 classification: Equal Interval "You tell # of classes
-: distribution of values not spread equally
+: evenly distributed data (on a histogram)
+: comparing data sets" Week01 classification: Defined interval You tell the width of classes
-: distribution of values also not spread equally across all classes
+: comparing data sets Week01 classification: Quantile Equal number of observations per class

+: symmetrical (normally distributed) data or mild/moderate skew (on histogram)
+: top/bottom percentile of values
-: doesn't consider natural gaps Week01 classification: Standard deviation divides data, categorizes into intervals of standard deviations above/below mean

+: to highlight how far values deviate from average
-: shows as z-scores
-: hard to understand Week01 classification: Geometric class Multiplicatively vary class widths

when data has a highly skewed distribution Week01 central tendency: mode - categorical data
- highest frequency value central tendency: Qualitative ordinal data, use Median central tendency: Mean "" central tendency: Weighted mean center pulls mean centre towards higher value weights central tendency: Median center location representing the shortest total distance to all other features

more robust to outliers central tendency: Central feature Chooses an *existing feature* in dataset that has shortest total distance to ALL other features dispersion: normally distributed data ~68% within 1 std deviation from mean,
~95% within 2,
~99% within 3 dispersion: Standard distance average distance of each feature to mean center
then use that distance as a radius centered on mean center

(Spatial equiv of std deviation) Standard deviational ellipse Like standard distance,

but calculate x and y coordinates to those of mean center Dispersion How spread out/ compact a dataset is around its location of central tendency Central tendency "Single location to summarize a set of locations, ""typical""/ ""average""/ most representative" Defining neighbours: Number of neighbours nbh defined by a specified number of features closest to the focal feature

distances can vary depending on density of features Defining neighbours: Fixed distance all features that fall within specified distance of focal features

num of neighbours depends on density of features in area Defining neighbours: Network distance Travel routes around focal feature

Fixed distance (no bridge) vs realistic Contiguity: Raster/Vector polygons vs points Only to polygons, because points have no edges Delaunay triangulation Contiguity but for points

Generates thiessen polygons on points, then use contiguity edges corners method Defining neighbours: Contiguity edges (Rook) shared border w/ focal feature considered a neighbour

(directly next to each other) Defining neighbours: Contiguity edges corners (Queen) border or corner shared feature considered neighbour Spatial analysis issues - MAUP, neighbourhood definition
- Boundary problem
- Spatial sampling
- Tobler Cluster analysis finding areas with unexpectedly high values, or finding groups of features with similar characteristics/values/locations, or finding point patterns in the landscape GIS ML tool types: - prediction
- classification
- clustering density-based clustering grouping of observations based on feature locations Tobler's first law of geog Everything is related to everything else,
but near things are more related than distant things Modifiable Areal Unit Problem (MAUP) Combined effects of scale and aggregation

how things are zoned AND the scale of geographic unit (then aggregated) can change outcomes

explore alternate zoning effects and hierarchical models Spatial weights matrix File that quantifies spatial relationships (neighbourhood/s) among a set of features Traditional statistical tests can often be applied to spatial data, BUT... doesn't account for Tobler's first law
and other spatial relationships

also, datasets w/ very different distributions can produce the same summary statistics central tendency: Median """Middle"" value, often better choice for skewed data
" central tendency: What do outliers do? TBA Dispersion: variance difference between min and max values in a distribution
Heavily impacted by outliers (frequency of values not considered) Dispersion: Standard deviation sqr(variance)

variance: sum of (observations - mean)2 all  /  mean z-scores (Standard score) standard deviations above/below mean

( observation - mean ) / stddev central tendency: mean centre spatial average (add up) AI ability of a machine to perform tasks traditionally requiring human intelligence Machine learning set of tools, algos, and techniques to allow computers to learn patterns in data and acquire info w/o human explicitly programming the process Deep learning Using trainable algos in the form of artificial neural networks (inspired by how human brain works) DBSCAN "# of features to be considered a cluster, max search distance" HDBSCAN uses series of nested clusters and chooses levels that create stable clusters having as many members as possible

OPTICS Uses reachability plot for distances between neighbours, peaks = big spatial jump, separates clusters

DBSCAN vs HDBSCAN DBSCAN struggles w/ different densities unlike HDBSCAN, where search distances can vary multivariate clustering GROUPING of observations based on feature attributes trad vs. spatially constrained multivariate clustering traditional:
SC:
MC: k-means Finds k groups in data based on feature attributes

SC-MvC: Minimum spanning tree features laid out in data / 'spatial' space, connected based on how far they are based on location and attribute values

links are THEN broken in ways that keep clusters as distinct as possible complete spatial randomness reference spatial distribution, simulates random pattern

you can compare observations to CSR negative spatial autocorrelation similar values scattered across space, things closer together likely to have diff values

underlying process causes REGULAR DISPERSION random pattern mix of clustering and dispersion positive spatial autocorrelation similar values clustered in space
things closer together likely to have similar values

underlying process leads to clustering p-value probability that the pattern seen is the result of a random process

(low is high certainty that it's not) Spatial autocorrelation: TEST TYPES Global Moran's I Clustering, dispersion, both?

Clustered +ve
Dispersal -ve
z-score: high/low suggests pattern unlikely to be random Getis-Ord G Clustering?

G: Spatial density, high values indicate high value clustering, low ... Getis-Ord Local Gi* (Scan statistic) ID Hot and cold spots (clusters) relative to mean ACROSS study area at different confidence levels (p-value)

is local nbh average (z-score) significantly different from global average?
could have low features included in hot spot Anselin Local Moran's I (LISA) (Scan statistic) Clusters: high-high, low-low
Outliers: high-low, low-high (a feature doesn't match w/ rest of nbh)

Is nbh average significantly different from global average, and
is feature value significantly different from nbh average? Bivariate Moran's I Clusters across 2 variables

high-high, low-low, high-low, low-high clusters

better than comparing visually linear mean: orientation vs direction orientation: angle only
direction: average length and angle Spatiotemporal pattern mining Finding patterns in space and through time by analyzing snapshots of data Problem w/ thematic/ hotspot map for each period of time Effectiveness of visual analysis decreases as volume of data increases

- difficult to visually quantify trends/change
- assumes each time period is completely independent (Jan doesn't affect Feb) Time-series analysis looking at how spatial patterns have changed over time Space-time cube 3D cube
x,y: location grid or polygon aggregation
z: time Space-time cube: Bin individual unit: unique spatiotemporal extent Space-time cube: location column of bins: same location, different temporal extents Space-time cube: Time Slice row of bins that share the same temporal extent (imagine off the top: most recent)

standalone analysis: 1 slice Space-time cube: Aggregation Polygons: probably should standardize

Bins (same size: no need to standardize):
- fishnet / square grid: quicker than hexagon
- hexagon: more edges = more neighbours, closest approximation to circle that fits nicely Space-time cube: Modifiable temporal unit problem Similar to MAUP but over time: How we aggregate time can affect our analysis

Data aggregated to months results != years
Consider: seasonality of data, but also more broad patterns

Also: Feb (less days = less car crashes)
Ensure aggregation divides evenly among data or chop off the oldest bit Emerging hot spot analysis Getis-Ord Gi*: Spatial + temporal (+-1) neighbours, compares to whole study area for hot/cold spots
Difference: z-score, Significance: p-value

Results:
3D: clustering/significance of clustering at each bin over time for entire space-time cube
2D: Top is summarized

17 categories (8+8+1): consider combining/removing categories depending on audience
can also examine a category (why is there oscillation?) Local Outlier Analysis "Extension of Anselin Local Moran's I: IDs clusters and local outliers in space + time (not as detailed as emerging hotspot analysis)

3D entire output: bin value -- neighbourhood value
2D summary output: no indication of changes to significance of clusters/outliers
- ""only"": has only ever been that
- multiple types, never significant" Inferential Statistics: categories Parametric statistics: key assumption: data matches some theoretical distribution (ex normal curve)

Non-parametric: no key assumption Pearson correlation: def Number representing linear correlation between 2 sets of data Pearson correlation: assumptions
  1. Variables are either interval or ratio (numeric w/ consistent steps, interval: 0 also meaningful)
  2. variables approx. a normal distribution
  3. Linear relationship between variables
  4. No extreme outliers in either dataset
Pearson correlation: outputs Correlation coefficient (-1 < 0 > +1)
p-value: * = 0.05, **=0.01, ***=0.001 (geoDa)
R^2: coefficient of variation how much variation is explained by figure

variance in dependence var is caused by indep var??? model: fitting data "
" Ordinary least squares regression (non-spatial) "" OLS: Assumptions OLS Residuals "Spatial autocorrelation, spatial non-stationarity: map OLS Residuals
" Spatial stationarity probability of a process being the same throughout space

(a change in one variable will result in a consistent change in another variable no matter where you are in study area) spatial non-stationarity large clusters of over/under-predictions

indicator that a process operates differently in different parts of study area Spatial dependency error terms/ variables somehow correlated

relationship between IV/DV not independent at each location

(violates traditional stats, see Tobler) GWR Geographically weighted regression OLS is global (1 equation for each study area), GWR creates a regression equation for each feature

benefit: accounts for tobler
gives map output? GWR Considerations - Multicollinearity
- because coefficients are weighted, this can be hard to interpret
- crit by trad stats ppl
- OLS: baseline, GWR tends to overfit AiCC used to compare quality of a set of models, smaller is better out of the set OLS Additional assumptions - multicollinearity condition number: concern if above 20 
(won't say which pair, use pearson)
- Jarque-Bera: normality of errors: p < 0.05 indicates a lack of normality
- Bruesch-Pagan, Koenker-Bassett: heteroskedascity: p < 0.05 indicates unequal variance, may be a sign of spatial dependence in data SLM lag: adjacent values are systematically related (autocorrelated), produces model bias

dependent variable in a location is affected by IV of another location, this diffusion can cause spatial lag

Lag coefficient (rho): degree of spatial dependence in data
- if likelihood ration test still significant, still issue of spatial dependence SEM nuisance (error): problem w/ model residuals only, not variables. correct by including a spatial error term in the model

    - lag coefficient lambda: degree of spatial dependence in the residuals
    - still spatial dependence (likelihood ratio test)