Science of data

Descriptive Statistics

  • quantitative vs qualitative data

Inferential Statistics

  • predictions, estimates of sample data → stock market
    • confidence intervals
    • hypothesis testing
    • regression analysis

Definitions

  • experimental unit → what we collect data from
  • population → group of experimental units → subgroup of all global data
    • e.g. all working adults in Austria
  • variable → characteristic or property in a population
  • sample → a subset of the population (not biased/selected)
    • e.g. the working adults in Austria asked by survey X
  • statistical inference → estimate, prediction, etc on population based on sample
    • e.g. the preference of cola brand based on 1.000 people survey
  • measure of reliability → statement about uncertainty

Data

Which data?

  • if data is qualitative or quantitative is dependent on the experiment
  • cloth size can be qualitative → (S, M, L) or quantitative → (120 cm, 150 cm)

Qualitative Variables

  • cannot be measured in numbers
  • nominal variable
    • classification → e.g. member of religion
    • definitely true with binary data points (yes/no)
  • ordinal variable
    • same like nominal variables
    • have an order
    • can be sorted, ranked
      • e.g. course grades

Quantitative Variables

  • continuous, metric variable
  • interval variable
    • linear scale of values → differences between values are meaningful
      • e.g. temperature, time
  • ratio variable
    • ratio between 2 values is meaningful
    • there must be a natural 0
      • e.g. temperature in Celsius has no natural 0, Calvin does

Sources

  • published source
  • designed experiment → full control over all variables, artificial environment
  • observational study → data collection within natural environment

Population vs Sample

  • representative sample
    • mostly done using random samples
    • analogy with pot of soup → could taste different in different parts of the pot
  • ensuring reliability of results

Sampling Error

  • Selection Bias → if certain subsets of a population are under/over represented within sample
  • non-response bias → not ensuring that data is collected from all experimental units
  • measurement error → just wrong values at time of analysis