Basics of Data Analysis

Basic Data Analysis

Common terms

  • Mean, weighted mean

  • Median, weighted median

  • Percentile

  • Trimmed mean

    • drop a fixed number of sorted values (p smallest and largest values omitted)

  • Robust: not sensitive to extreme values

  • Outlier

  • Mode

    • most frequently occuring value

MAD

MAD(Mean absolute deviation)

Median absolute deviation from median

Variance and Standard deviation can be largely affected by outliers.

  • can use Median absolute deviation

Also, the range from the max and min value can be sensitive to outliers.

  • Use IQR instead

IQR: Interquartile range

Interquartile range is the amount of spread in the middle 50% percent of a dataset.

In other words, it is the distance between the first quartile

IQR=Q3āˆ’Q1IQR =Q_3-Q_1

Here's how to find the IQR:

Step 1: Put the data in order from least to greatest.

Step 2: Find the median. If the number of data points is odd, the median is the middle data point. If the number of data points is even, the median is the average of the middle two data points.

Step 3: Find the first quartile (\text{Q}_1)(Q1​)left parenthesis, start text, Q, end text, start subscript, 1, end subscript, right parenthesis. The first quartile is the median of the data points to the left of the median in the ordered list.

Step 4: Find the third quartile (\text{Q}_3)(Q3​)left parenthesis, start text, Q, end text, start subscript, 3, end subscript, right parenthesis. The third quartile is the median of the data points to the right of the median in the ordered list.

Step 5: Calculate IQR by subtracting IQR =Q_3-Q_1

Example

Find the IQR of these scores: 1, 3, 3, 3, 4, 4, 4, 6, 6

  • Total of 9 data, median: 4

  • 75% percentile: mid of (4,4,6,6) = (4+6)/2=5

  • 25% percentile: mid of (1,3,3,3) = (3+3)/2=3

Boxplot

Box: Shows median, 75th and 25th percentiles.

Whiskers: range to Inlier data (up to 1.5*IQR)

Outliers: shown outside the whisker range.

Density Plots and Estimates

Similar to histogram, shows the distribution of data as a continuous line.

A smooth histogram can be plotted using a kernel densitiy estimate

  • Parametric density estimation

    • normal distribution with std. deviation and mean etc

  • Nonparametric density estimation

    • Kernel Density Estimation: Nonparametric method for using a dataset to estimating probabilities for new points.

Barcharts, Histogram

bar chart: x-axis represent different categories of a factor variable

  • for categorial data

histogram: x-axis represents values of a single variable

Correlation

Positively correlated: when X goes up, then Y goes up

Negatively correlated: when X goes up, then Y goes down

Correlation coefficient

  • A metric that measures the extent to which numeric variables are associated with one another (range from -1 to 1)

  • One metric is Pearson's correlation coefficient

For two data variables, a scatter plot can be used to check the correlation

  • other plots can be heat maps, contour plots

Last updated

Was this helpful?