Qualitative Data

  • no numbers only categories
  • absolute frequency: amount of data points fitting a description (category)
    • e.g. passengers being male and in 2nd class
    • base for bar charts
  • relative frequency: relative amount of data points fitting a description (category)
    • e.g. 70% of passengers are male
    • base for pie charts
    • category | absolute freq | relative freq
    • SUM | n | 1

Graphical Methods for Quantitative Data

Histogram

  • cut data set into multiple regions (intervals)
    • e.g. height of students in 5 cm intervals
    • how to choose amount of intervals?
      • <25 data points 5-6 intervals
      • 25-50 data points 7-14 intervals
      • 50 data points 14-20 intervals

      • or whatever is sensible for the problem at hand
  • assign each data point to one such interval
  • calculate relative frequency and density
    • rel. freq. = density * width
  • plot relative frequency in a bar chart (without distance between the bars)
  • from plot alone one cannot “zoom in” and find out about sub-interval distributions
  • but one can add up multiple intervals and build larger intervals (e.g. first 5 intervals)
  • used in e.g. binomial distributions (geogebra)
  • used in e.g. kibana log volume (how many logs were ingested between 13 and 14 today?)
  • R has built-in data sets (e.g. flow of File)

Empirical Distribution Function

  • base is a series of relative frequencies
  • function finds the overall relative frequency that data points are below
    • basically splitting the data set into 2 at the point and telling
  • range is
  • height of function value is
  • jump size at is relative frequency of points between current and previous
    • if not cumulative distribution then the amount of values between and is always , therefore it is the relative frequency of that value

Scatter Plot

Numerical Methods for Quantitative Data

  • data set … sample or population
  • numerical values later on user for Inference
  • central tendency … clustering, centering around certain values
  • variability … spread of data
  • relative standing … relationship of a measurement to rest of the data

Central Tendency

  • mean … average of all values
    • sum up and divide
    • can be used for approximation of population mean
  • median … value of center value (sorted)
    • if is even (no center value) the median is the mean of the 2 center values
    • robustness: more robust than mean against outliers
  • skewness
    • one tail of dataset has more extreme observations than the other
    • negative skewness left skewness (beginning)
    • positive skewness right skewness (end)
    • skewness … pulling the mean away from the median
  • mode
    • measurement of maximum frequency within dataset
    • which value appeared the most amount
    • modes can be multiple (e.g. 4 and 12 appeared both 8 times within data set)
    • when reading from histograms
      • model class … largest relative frequency (interval with highest frequency)
      • center of model class considered as mode (only approximation)

Variability

  • range
    • difference minimum and maximum measurement
    • problem: 2 datasets can have same range but vastly different data
  • sample variance
    • measurement of deviation from mean across entire data set
      • measurement of spread-outedness
    • important: divide by , not like in Central Tendency
    • unit is squared as well (e.g. dollas squared), even though it is not a useful unit
    • why squared in the sum? otherwise the distances to the mean would cancel each other out
      • absolute value is not a nice function square is easier
  • sample standard deviation
    • unit is not squared (e.g. dollars), this is a useful unit
    • Empirical Rule:
      • area of including standard deviation is always %
      • difficult to get a measurement outside of range (0.3%)
    • Chebyshev’s Rule:
      • can be used as a sanity check or crude outlier detection

Relative Standing

  • percentiles (quantiles)
    • roughly: th percentile
      • percent of measurements fall within range
  • quartiles
    • just a special percentile (4 regions )
    • square is a special rectangle
  • z-score
    • distance between current value and mean, expressed in standard deviations
    • e.g. 2.5 standard deviations above the mean z = +2.5
      • above mean positive | below mean negative

Outlier Detection

  • reasons:
    • Fettfinger: value entered incorrectly (one extra zero, whoops)
    • Confusion: value comes from different population
    • Irrelevant: value is too high to be relevant (too rare event)
      • Is this some peasant joke I am too rich to understand?
  • methods:
    • z-scores (numerical)
      • generally if greater 3 outlier
    • box plots (graphical)

Midterm Cutoff

Box Plots

  • “Five Point Summary”
    • lower / upper extreme
    • , and (L … lower, U … upper)
  • … spread of center 50% region
  • drawing
    • default in R is vertical, horizontal preferred by humans
      • left … down, right … up
    • draw a rectangle from to (width is )
    • draw “whisker” until 1.5 both directions (inner fence)
    • draw end points until 3 both directions (outer fence)
    • any points which are not on the whiskers or in the recangle are drawn separately
      • most likely outliers
  • interpretation
    • Median, Lower and Upper Quartile
    • left/right skewness
    • mount / values of outliers