#separator:tab
#html:true
#tags column:3
classification:&nbsp;Natural Breaks	When natural clusters in data are present<br>Minimizes variance within classes, max variance between	Week01
classification:&nbsp;Equal Interval	"You tell # of classes<br>-: distribution of values not spread equally<br>+: evenly distributed data (on a histogram)<br>+: comparing data sets"	Week01
classification: Defined interval	You tell the width of classes<br>-: distribution of values also not spread equally across all classes<br>+: comparing data sets	Week01
classification:&nbsp;Quantile	Equal number of observations per class<br><br>+: symmetrical (normally distributed) data or mild/moderate skew (on histogram)<br>+: top/bottom percentile of values<br>-: doesn't consider natural gaps	Week01
classification: Standard deviation	divides data, categorizes into intervals of standard deviations above/below mean<br><br>+: to highlight how far values deviate from average<br>-: shows as z-scores<br>-: hard to understand	Week01
classification: Geometric class	Multiplicatively vary class widths<br><br>when data has a highly skewed distribution	Week01
central tendency: mode	- categorical data<br>- highest frequency value	
central tendency: Qualitative ordinal data, use	Median	
central tendency: Mean	"<ul><li>""typical"" score, best for normally distributed data</li><li>outliers have strong influence</li><li>ex bimodal data /\_......_/\, mean not useful</li><li>ex location, point data only for LA, NYC, mean not useful</li><li>ex calculated value may not be in dataset</li></ul>"	
central tendency: Weighted mean center	pulls mean centre towards higher value weights	
central tendency: Median center	location representing the shortest total distance to all other features<br><br>more robust to outliers	
central tendency: Central feature	Chooses an *existing feature* in dataset that has shortest total distance to ALL other features	
dispersion: normally distributed data	~68% within 1 std deviation from mean,<br>~95% within 2,<br>~99% within 3	
dispersion: Standard distance	average distance of each feature to mean center<br>then use that distance as a radius centered on mean center<br><br>(Spatial equiv of std deviation)	
Standard deviational ellipse	Like standard distance,<br><br>but calculate x and y coordinates to those of mean center	
Dispersion	How spread out/ compact a dataset is around its location of central tendency	
Central tendency	"Single location to summarize a set of locations, ""typical""/ ""average""/ most representative"	
Defining neighbours: Number of neighbours	nbh defined by a specified number of features closest to the focal feature<br><br>distances can vary depending on density of features	
Defining neighbours:&nbsp;Fixed distance	all features that fall within specified distance of focal features<br><br>num of neighbours depends on density of features in area	
Defining neighbours:&nbsp;Network distance	Travel routes around focal feature<br><br>Fixed distance (no bridge) vs realistic	
Contiguity: Raster/Vector polygons vs points	Only to polygons, because points have no edges	
Delaunay triangulation	Contiguity but for points<br><br>Generates thiessen polygons on points, then use contiguity edges corners method	
Defining neighbours: Contiguity edges (Rook)	shared border w/ focal feature considered a neighbour<br><br>(directly next to each other)	
Defining neighbours:&nbsp;Contiguity edges corners (Queen)	border or corner shared feature considered neighbour	
Spatial analysis issues	- MAUP, neighbourhood definition<br>- Boundary problem<br>- Spatial sampling<br>- Tobler	
Cluster analysis	finding areas with unexpectedly high values, or finding groups of features with similar characteristics/values/locations, or finding point patterns in the landscape	
GIS ML tool types:	- prediction<br>- classification<br>- clustering	
density-based clustering	grouping of observations based on feature locations	
Tobler's first law of geog	Everything is related to everything else,<br>but near things are more related than distant things	
Modifiable Areal Unit Problem (MAUP)	Combined effects of scale and aggregation<br><br>how things are zoned AND the scale of geographic unit (then aggregated) can change outcomes<br><br>explore alternate zoning effects and hierarchical models	
Spatial weights matrix	File that quantifies spatial relationships (neighbourhood/s) among a set of features	
Traditional statistical tests can often be applied to spatial data, BUT...	doesn't account for Tobler's first law<br>and other spatial relationships<br><br>also, datasets w/ very different distributions can produce the same summary statistics	
central tendency: Median	"""Middle"" value, often better choice for skewed data<br><ul><li>common for socio{demographic,economic} data</li><li>exact centre of distribution</li></ul>"	
central tendency: What do outliers do?		TBA
Dispersion: variance	difference between min and max values in a distribution<br>Heavily impacted by outliers (frequency of values not considered)	
Dispersion: Standard deviation	sqr(variance)<br><br>variance: sum of (observations - mean)<sup>2</sup>&nbsp;all&nbsp; /&nbsp; mean	
z-scores (Standard score)	standard deviations above/below mean<br><br>( observation - mean ) / stddev	
central tendency: mean centre	spatial average (add up)	
AI	ability of a machine to perform tasks traditionally requiring human intelligence	
Machine learning	set of tools, algos, and techniques to allow computers to learn patterns in data and acquire info w/o human explicitly programming the process	
Deep learning	Using trainable algos in the form of artificial neural networks (inspired by how human brain works)	
DBSCAN	"# of features to be considered a cluster, max search distance<ul><li>FIXED SEARCH DISTANCE, to find clusters of similar densities</li><li>fastest computationally</li></ul>"	
HDBSCAN	uses series of nested clusters and chooses levels that create stable clusters having as many members as possible<br><br><ul><li>can find clusters of varying densities</li><li>most data-driven</li></ul>	
OPTICS	Uses reachability plot for distances between neighbours, peaks = big spatial jump, separates clusters<br><br><ul><li>ex 2 peaks in a row: noise point</li><li>can adjust sensitivity</li><li>most computationally-intensive</li></ul>	
DBSCAN vs HDBSCAN	DBSCAN struggles w/ different densities unlike HDBSCAN, where search distances can vary	
multivariate clustering	GROUPING of observations based on feature attributes	
trad vs. spatially constrained multivariate clustering	traditional:<br><ul><li>not explicitly spatial, but can be applied to spatial problems/ data</li><li>Features in group more alike, but groups may not be spatially contiguous</li></ul>SC:<br><ul><li>Explicitly incorporate geography</li><li>AKA spatially contiguous groups but maybe less alike</li></ul>	
MC: k-means	Finds <i>k</i>&nbsp;groups in data based on feature attributes<br><br><ul><li>think number of variables, plotted on a n-D space, find clusters from there</li><li>boxplot + map (but not inherently spatial)</li></ul>	
SC-MvC: Minimum spanning tree	features laid out in data / 'spatial' space, connected based on how far they are based on location and attribute values<br><br>links are THEN broken in ways that keep clusters as distinct as possible	
complete spatial randomness	reference spatial distribution, simulates random pattern<br><br>you can compare observations to CSR	
negative spatial autocorrelation	similar values scattered across space, things closer together likely to have diff values<br><br>underlying process causes REGULAR DISPERSION	
random pattern	mix of clustering and dispersion	
positive spatial autocorrelation	similar values clustered in space<br>things closer together likely to have similar values<br><br>underlying process leads to clustering	
p-value	probability that the pattern seen is the result of a random process<br><br>(low is high certainty that it's not)	
Spatial autocorrelation: TEST TYPES	<ul><li>Global: whole study area, generalizes as summary statistic, NO MAP</li><li>Local: global on a subset of the study area</li><li>Scan Statistics: search multiple subsets, return where clustering/dispersal</li></ul>	
Global Moran's I	Clustering, dispersion, both?<br><br>Clustered +ve<br>Dispersal -ve<br>z-score: high/low suggests pattern unlikely to be random	
Getis-Ord G	Clustering?<br><br>G: Spatial density, high values indicate high value clustering, low ...	
Getis-Ord Local Gi* (Scan statistic)	ID Hot and cold spots (clusters) relative to mean ACROSS study area at different confidence levels (p-value)<br><br>is local nbh average (z-score) significantly different from global average?<br>could have low features included in hot spot	
Anselin Local Moran's I (LISA) (Scan statistic)	Clusters: high-high, low-low<br>Outliers: high-low, low-high (a feature doesn't match w/ rest of nbh)<br><br>Is nbh average significantly different from global average, and<br>is feature value significantly different from nbh average?	
Bivariate Moran's I	Clusters across 2 variables<br><br>high-high, low-low, high-low, low-high clusters<br><br>better than comparing visually	
linear mean: orientation vs direction	orientation: angle only<br>direction: average length and angle	
Spatiotemporal pattern mining	Finding patterns in space and through time by analyzing snapshots of data	
Problem w/ thematic/ hotspot map for each period of time	Effectiveness of visual analysis decreases as volume of data increases<br><br>- difficult to visually quantify trends/change<br>- assumes each time period is completely independent (Jan doesn't affect Feb)	
Time-series analysis	looking at how spatial patterns have changed over time	
Space-time cube	3D cube<br>x,y: location grid or polygon aggregation<br>z: time	
Space-time cube: Bin	individual unit: unique spatiotemporal extent	
Space-time cube: location	column of bins: same location, different temporal extents	
Space-time cube: Time Slice	row of bins that share the same temporal extent (imagine off the top: most recent)<br><br>standalone analysis: 1 slice	
Space-time cube: Aggregation	Polygons: probably should standardize<br><br>Bins (same size: no need to standardize):<br>- fishnet / square grid: quicker than hexagon<br>- hexagon: more edges = more neighbours, closest approximation to circle that fits nicely	
Space-time cube: Modifiable temporal unit problem	Similar to MAUP but over time: How we aggregate time can affect our analysis<br><br>Data aggregated to months results != years<br>Consider: seasonality of data, but also more broad patterns<br><br>Also: Feb (less days = less car crashes)<br>Ensure aggregation divides evenly among data or chop off the oldest bit	
Emerging hot spot analysis	Getis-Ord Gi*: Spatial + temporal (+-1) neighbours, compares to whole study area for hot/cold spots<br>Difference: z-score, Significance: p-value<br><br>Results:<br>3D: clustering/significance of clustering at each bin over time for entire space-time cube<br>2D: Top is summarized<br><br>17 categories (8+8+1): consider combining/removing categories depending on audience<br>can also examine a category (why is there oscillation?)	
Local Outlier Analysis	"Extension of Anselin Local Moran's I: IDs clusters and local outliers in space + time (not as detailed as emerging hotspot analysis)<br><br>3D entire output:&nbsp;<i>bin value --&nbsp;neighbourhood value</i><br>2D summary output: no indication of <b>changes</b> to significance of clusters/outliers<br>- ""only"": has only ever been that<br>- multiple types, never significant"	
Inferential Statistics: categories	Parametric statistics: key assumption: data matches some theoretical distribution (ex normal curve)<br><br>Non-parametric: no key assumption	
Pearson correlation: def	Number representing linear correlation between 2 sets of data	
Pearson correlation: assumptions	<ol><li>Variables are either interval or ratio (numeric w/ consistent steps, interval: 0 also meaningful)</li><li>variables approx. a normal distribution</li><li>Linear relationship between variables</li><li>No extreme outliers in either dataset</li></ol>	
Pearson correlation: outputs	Correlation coefficient (-1 &lt; 0 &gt; +1)<br><ul><li>0: none, 1: perfectly linear</li><li>0.01-0.30: Weak</li><li>0.31-0.70: Moderate</li><li>0.71 - 1.0: Strong</li></ul><div>p-value: * = 0.05, **=0.01, ***=0.001 (geoDa)</div>	
R^2: coefficient of variation	how much variation is explained by figure<br><br>variance in dependence var is caused by indep var???	
model: fitting data	"<img src=""paste-da5f6325ad7847f54cba78df37550a1f39640cb7.jpg""><br><ul><li>Underfitting, too general, doesn't capture relationships in training data, can't generate useful predictions</li><li>Overfitting: model too complex, cannot generalize so cannot predict</li></ul>"	
Ordinary least squares regression (non-spatial)	"<ul><li>linear regression model (eqn)</li><li>cause/effect unidirectional</li><li>y = strength + x_var(coefficient) + .... + residual</li><li><img src=""paste-e4656c0fe0d242e1e516aaac91b8bf814814454c.jpg""><br></li></ul>"	
OLS: Assumptions	<ul><li>linear relationship between variables (Pearsons)</li><li>Relationship is homoscedastic (no skew on scatterplot of both)</li><li>No multicollinearity (no correlation between independent variables), less applicable when only 2. why? affects model reliability, risk of *-fitting &lt;--. Test: use pearsons correlation for each pair</li><li>From output: Residuals are normally distributed</li><li>From output: Residals are not spatially autocorrelated (run global Moran's I)</li></ul>	
OLS Residuals	"Spatial autocorrelation, spatial non-stationarity: map OLS Residuals<br><img src=""paste-e4656c0fe0d242e1e516aaac91b8bf814814454c.jpg"">"	
Spatial stationarity	probability of a process being the same throughout space<br><br>(a change in one variable will result in a consistent change in another variable no matter where you are in study area)	
spatial non-stationarity	large clusters of over/under-predictions<br><br>indicator that a process operates differently in different parts of study area	
Spatial dependency	error terms/ variables somehow correlated<br><br>relationship between IV/DV not independent at each location<br><br>(violates traditional stats, see Tobler)	
GWR Geographically weighted regression	OLS is global (1 equation for each study area), GWR creates a regression equation for each feature<br><br>benefit: accounts for tobler<br>gives map output?	
GWR Considerations	- Multicollinearity<br>- because coefficients are weighted, this can be hard to interpret<br>- crit by trad stats ppl<br>- OLS: baseline, GWR tends to overfit	
AiCC	used to compare quality of a set of models, smaller is better out of the set	
OLS Additional assumptions	- multicollinearity condition number: concern if above 20&nbsp;<br>(won't say which pair, use pearson)<br>- Jarque-Bera: normality of errors: p &lt; 0.05 indicates a lack of normality<br>- Bruesch-Pagan, Koenker-Bassett: heteroskedascity: p &lt; 0.05 indicates unequal variance, may be a sign of spatial dependence in data	
SLM	lag: adjacent values are systematically related (autocorrelated), produces model bias<br><br>dependent variable in a location is affected by IV of another location, this diffusion can cause spatial lag<br><br>Lag coefficient (rho): degree of spatial dependence in data<br>- if likelihood ration test still significant, still issue of spatial dependence	
SEM	nuisance (error): problem w/ model residuals only, not variables. correct by including a spatial error term in the model<br><br>&nbsp; &nbsp; - lag coefficient lambda: degree of spatial dependence in the residuals<br>&nbsp;&nbsp;&nbsp; - still spatial dependence (likelihood ratio test)<br>