Basics of Data Analysis
Last updated
Last updated
Mean, weighted mean
Median, weighted median
Percentile
Trimmed mean
drop a fixed number of sorted values (p smallest and largest values omitted)
Robust: not sensitive to extreme values
Outlier
Mode
most frequently occuring value
MAD(Mean absolute deviation)
Median absolute deviation from median
Variance and Standard deviation can be largely affected by outliers.
can use Median absolute deviation
Also, the range from the max and min value can be sensitive to outliers.
Use IQR instead
Interquartile range is the amount of spread in the middle 50% percent of a dataset.
In other words, it is the distance between the first quartile
Here's how to find the IQR:
Step 1: Put the data in order from least to greatest.
Step 2: Find the median. If the number of data points is odd, the median is the middle data point. If the number of data points is even, the median is the average of the middle two data points.
Step 3: Find the first quartile (\text{Q}_1)(Q1)left parenthesis, start text, Q, end text, start subscript, 1, end subscript, right parenthesis. The first quartile is the median of the data points to the left of the median in the ordered list.
Step 4: Find the third quartile (\text{Q}_3)(Q3)left parenthesis, start text, Q, end text, start subscript, 3, end subscript, right parenthesis. The third quartile is the median of the data points to the right of the median in the ordered list.
Step 5: Calculate IQR by subtracting IQR =Q_3-Q_1
Find the IQR of these scores: 1, 3, 3, 3, 4, 4, 4, 6, 6
Total of 9 data, median: 4
75% percentile: mid of (4,4,6,6) = (4+6)/2=5
25% percentile: mid of (1,3,3,3) = (3+3)/2=3
Box: Shows median, 75th and 25th percentiles.
Whiskers: range to Inlier data (up to 1.5*IQR)
Outliers: shown outside the whisker range.
Similar to histogram, shows the distribution of data as a continuous line.
A smooth histogram can be plotted using a kernel densitiy estimate
Parametric density estimation
normal distribution with std. deviation and mean etc
Nonparametric density estimation
Kernel Density Estimation: Nonparametric method for using a dataset to estimating probabilities for new points.
bar chart: x-axis represent different categories of a factor variable
for categorial data
histogram: x-axis represents values of a single variable
Positively correlated: when X goes up, then Y goes up
Negatively correlated: when X goes up, then Y goes down
Correlation coefficient
A metric that measures the extent to which numeric variables are associated with one another (range from -1 to 1)
One metric is Pearson's correlation coefficient
For two data variables, a scatter plot can be used to check the correlation
other plots can be heat maps, contour plots