Science of data
Descriptive Statistics
- quantitative vs qualitative data
Inferential Statistics
- predictions, estimates of sample data â stock market
- confidence intervals
- hypothesis testing
- regression analysis
Definitions
- experimental unit â what we collect data from
- population â group of experimental units â subgroup of all global data
- e.g. all working adults in Austria
- variable â characteristic or property in a population
- sample â a subset of the population (not biased/selected)
- e.g. the working adults in Austria asked by survey X
- statistical inference â estimate, prediction, etc on population based on sample
- e.g. the preference of cola brand based on 1.000 people survey
- measure of reliability â statement about uncertainty
Data
Which data?
- if data is qualitative or quantitative is dependent on the experiment
- cloth size can be qualitative â (S, M, L) or quantitative â (120 cm, 150 cm)
Qualitative Variables
- cannot be measured in numbers
- nominal variable
- classification â e.g. member of religion
- definitely true with binary data points (yes/no)
- ordinal variable
- same like nominal variables
- have an order
- can be sorted, ranked
Quantitative Variables
- continuous, metric variable
- interval variable
- linear scale of values â differences between values are meaningful
- ratio variable
- ratio between 2 values is meaningful
- there must be a natural 0
- e.g. temperature in Celsius has no natural 0, Calvin does
Sources
- published source
- designed experiment â full control over all variables, artificial environment
- observational study â data collection within natural environment
Population vs Sample
- representative sample
- mostly done using random samples
- analogy with pot of soup â could taste different in different parts of the pot
- ensuring reliability of results
Sampling Error
- Selection Bias â if certain subsets of a population are under/over represented within sample
- non-response bias â not ensuring that data is collected from all experimental units
- measurement error â just wrong values at time of analysis