Qualitative Data
- no numbers → only categories
- absolute frequency: amount of data points fitting a description (category)
- e.g. passengers being male and in 2nd class
- base for bar charts
- relative frequency: relative amount of data points fitting a description (category)
- e.g. 70% of passengers are male
- base for pie charts
category | absolute freq | relative freqSUM | n | 1
Graphical Methods for Quantitative Data
Histogram
- cut data set into multiple regions (intervals)
- e.g. height of students in 5 cm intervals
- how to choose amount of intervals?
- <25 data points → 5-6 intervals
- 25-50 data points → 7-14 intervals
-
50 data points → 14-20 intervals
- or whatever is sensible for the problem at hand
- assign each data point to one such interval
- calculate relative frequency and density
- rel. freq. = density * width
- plot relative frequency in a bar chart (without distance between the bars)
- from plot alone one cannot “zoom in” and find out about sub-interval distributions
- but one can add up multiple intervals and build larger intervals (e.g. first 5 intervals)
- used in e.g. binomial distributions (geogebra)
- used in e.g. kibana log volume (how many logs were ingested between 13 and 14 today?)
- R has built-in data sets (e.g. flow of File)
Empirical Distribution Function
- base is a series of relative frequencies
- function finds the overall relative frequency that data points are below
- basically splitting the data set into 2 at the point and telling
- range is
- height of function value is
- jump size at is relative frequency of points between current and previous
- if not cumulative distribution then the amount of values between and is always , therefore it is the relative frequency of that value
Scatter Plot
- pairing x and y values graphically (drawing a dot)
- used in Regression and Correlation Analysis
Numerical Methods for Quantitative Data
- data set … sample or population
- numerical values later on user for Inference
- central tendency … clustering, centering around certain values
- variability … spread of data
- relative standing … relationship of a measurement to rest of the data
Central Tendency
- mean … average of all values
- sum up and divide
- can be used for approximation of population mean
- median … value of center value (sorted)
- if is even (no center value) the median is the mean of the 2 center values
- robustness: more robust than mean against outliers
- if is even (no center value) the median is the mean of the 2 center values
- skewness
- one tail of dataset has more extreme observations than the other
- negative skewness → left skewness (beginning)
- positive skewness → right skewness (end)
- skewness … pulling the mean away from the median
- mode
- measurement of maximum frequency within dataset
- which value appeared the most amount
- modes can be multiple (e.g. 4 and 12 appeared both 8 times within data set)
- when reading from histograms
- model class … largest relative frequency (interval with highest frequency)
- center of model class considered as mode (only approximation)
Variability
- range
- difference minimum and maximum measurement
- problem: 2 datasets can have same range but vastly different data
- sample variance
- measurement of deviation from mean across entire data set
- measurement of spread-outedness
- important: divide by , not like in Central Tendency
- unit is squared as well (e.g. dollas squared), even though it is not a useful unit
- why squared in the sum? otherwise the distances to the mean would cancel each other out
- absolute value is not a nice function → square is easier
- measurement of deviation from mean across entire data set
- sample standard deviation
- unit is not squared (e.g. dollars), this is a useful unit
- Empirical Rule:
- area of including standard deviation is always %
- difficult to get a measurement outside of range (0.3%)
- Chebyshev’s Rule:
-
- can be used as a sanity check or crude outlier detection
-
Relative Standing
- percentiles (quantiles)
- roughly: th percentile
- percent of measurements fall within range
- roughly: th percentile
- quartiles
- just a special percentile (4 regions )
- square is a special rectangle
- z-score
- distance between current value and mean, expressed in standard deviations
- e.g. 2.5 standard deviations above the mean → z = +2.5
- above mean → positive | below mean → negative
Outlier Detection
- reasons:
- Fettfinger: value entered incorrectly (one extra zero, whoops)
- Confusion: value comes from different population
- Irrelevant: value is too high to be relevant (too rare event)
- Is this some peasant joke I am too rich to understand?
- methods:
- z-scores (numerical)
- generally if greater 3 → outlier
- box plots (graphical)
- z-scores (numerical)
Midterm Cutoff
Box Plots
- “Five Point Summary”
- lower / upper extreme
- , and (L … lower, U … upper)
- … spread of center 50% region
- drawing
- default in R is vertical, horizontal preferred by humans
- left … down, right … up
- draw a rectangle from to (width is )
- draw “whisker” until 1.5 both directions (inner fence)
- draw end points until 3 both directions (outer fence)
- any points which are not on the whiskers or in the recangle are drawn separately
- most likely outliers
- default in R is vertical, horizontal preferred by humans
- interpretation
- Median, Lower and Upper Quartile
- left/right skewness
- mount / values of outliers