Qualitative Data

no numbers → only categories
absolute frequency: amount of data points fitting a description (category)
- e.g. passengers being male and in 2nd class
- base for bar charts
relative frequency: relative amount of data points fitting a description (category)
- e.g. 70% of passengers are male
- base for pie charts
- category | absolute freq | relative freq
- SUM | n | 1

Graphical Methods for Quantitative Data

cut data set into multiple regions (intervals)
- e.g. height of students in 5 cm intervals
- how to choose amount of intervals?
  - <25 data points → 5-6 intervals
  - 25-50 data points → 7-14 intervals
  - 50 data points → 14-20 intervals
  - or whatever is sensible for the problem at hand
assign each data point to one such interval
calculate relative frequency and density
- rel. freq. = density * width
plot relative frequency in a bar chart (without distance between the bars)
from plot alone one cannot “zoom in” and find out about sub-interval distributions
but one can add up multiple intervals and build larger intervals (e.g. first 5 intervals)
used in e.g. binomial distributions (geogebra)
used in e.g. kibana log volume (how many logs were ingested between 13 and 14 today?)
R has built-in data sets (e.g. flow of File)

base is a series of relative frequencies
function finds the overall relative frequency that data points are below $x$
- basically splitting the data set into 2 at the point $x$ and telling
range is $[0, 1]$
height of function value is
jump size at $x_{2}$ is relative frequency of points between current $x_{2}$ and previous $x_{1}$
- if not cumulative distribution then the amount of values between $x_{2}$ and $x_{1}$ is always $1$ , therefore it is the relative frequency of that value

mean … average of all values
- $\overset{x}{ˉ} = \frac{1}{n} * \sum_{i = 1}^{n} x_{i}$
- sum up and divide
- can be used for approximation of population mean $μ$
median … value of center value (sorted)
- if $n$ is even (no center value) the median is the mean of the 2 center values
  - $M = \frac{x _{prev} + x _{after}}{2}$
- robustness: more robust than mean against outliers
skewness
- one tail of dataset has more extreme observations than the other
- negative skewness → left skewness (beginning)
- positive skewness → right skewness (end)
- skewness … pulling the mean away from the median
mode
- measurement of maximum frequency within dataset
- which value appeared the most amount
- modes can be multiple (e.g. 4 and 12 appeared both 8 times within data set)
- when reading from histograms
  - model class … largest relative frequency (interval with highest frequency)
  - center of model class considered as mode (only approximation)

range
- difference minimum and maximum measurement
- problem: 2 datasets can have same range but vastly different data
sample variance
- measurement of deviation from mean across entire data set
  - measurement of spread-outedness
- important: divide by $n - 1$ , not $n$ like in Central Tendency
- $s^{2} = \frac{1}{n - 1} * \sum_{i = 1}^{n} (x_{i} - \overset{x}{ˉ})^{2}$
- $s^{2} = \frac{1}{n - 1} * \sum_{i = 1}^{n} (x_{i} - (\frac{1}{n} * \sum_{i = 1}^{n} x_{i}))^{2}$
- unit is squared as well (e.g. dollas squared), even though it is not a useful unit
- why squared in the sum? otherwise the distances to the mean would cancel each other out
  - absolute value is not a nice function → square is easier
sample standard deviation
- $s = s^{2}$
- unit is not squared (e.g. dollars), this is a useful unit
- Empirical Rule:
  - area of including $k$ standard deviation is always $n$ %
  - $1 s = μ - 1 σ <> μ + 1 σ = 68%$
  - $2 s = μ - 2 σ <> μ + 2 σ = 95%$
  - $3 s = μ - 3 σ <> μ + 3 σ = 99.7%$
  - difficult to get a measurement outside of $3 σ$ range (0.3%)
- Chebyshev’s Rule:
  - $\forall k > 1∣ at least (1 - \frac{1}{k ^{2}}) are within k standard derivations$
    - $k = 2 \to 75%$
    - $k = 3 \to 88.8%$
    - $k = 3 \to 99.3%$
  - can be used as a sanity check or crude outlier detection

percentiles (quantiles)
- roughly: $p$ th percentile
  - $p$ percent of measurements fall within range
quartiles
- just a special percentile (4 regions $\Rightarrow p = 4$ )
- square is a special rectangle
z-score
- distance between current value and mean, expressed in standard deviations
- $z = \frac{x - x ˉ}{s} = \frac{x - μ}{σ}$
- e.g. 2.5 standard deviations above the mean → z = +2.5
  - above mean → positive | below mean → negative

reasons:
- Fettfinger: value entered incorrectly (one extra zero, whoops)
- Confusion: value comes from different population
- Irrelevant: value is too high to be relevant (too rare event)
  - Is this some peasant joke I am too rich to understand?
methods:
- z-scores (numerical)
  - generally if greater 3 → outlier
- box plots (graphical)

“Five Point Summary”
- lower / upper extreme
- $M$ , $Q_{L}$ and $Q_{U}$ (L … lower, U … upper)
$I QR = Q_{U} - Q_{L}$ … spread of center 50% region
drawing
- default in R is vertical, horizontal preferred by humans
  - left … down, right … up
- draw a rectangle from $Q_{L}$ to $Q_{U}$ (width is $I QR$ )
- draw “whisker” until 1.5 $I QR$ both directions (inner fence)
- draw end points until 3 $I QR$ both directions (outer fence)
- any points which are not on the whiskers or in the recangle are drawn separately
  - most likely outliers
interpretation
- Median, Lower and Upper Quartile
- left/right skewness
- mount / values of outliers