# Basics of Data Analysis

## Basic Data Analysis

### Common terms

* Mean, weighted mean
* Median, weighted median
* Percentile
* Trimmed mean
  * drop a fixed number of sorted values (p smallest and largest values omitted)
* Robust: not sensitive to extreme values
* Outlier
* Mode
  * most frequently occuring value

### MAD

MAD(Mean absolute deviation)

Median absolute deviation from median

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-34f6c74312338f8f565f9fee99a8f1ade33c0c06%2Fimage.png?alt=media)

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-df68132375ed2bb2e8402e4e27664d067588738b%2Fimage.png?alt=media)

Variance and Standard deviation can be largely affected by outliers.

* can use Median absolute deviation

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-6b7bcc42db7952ecb9dfebdd892a76d3549923d7%2Fimage.png?alt=media)

*

Also, the range from the max and min value can be sensitive to outliers.

* Use IQR instead

### IQR: Interquartile range

Interquartile range is the amount of spread in the middle 50% percent of a dataset.

In other words, it is the distance between the first quartile

$$
IQR =Q\_3-Q\_1
$$

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-9284b783dbd74e9786ec9153dd2b3effd475d3b8%2Fimage.png?alt=media)

Here's how to find the IQR:

**Step 1:** Put the data in order from least to greatest.

**Step 2:** Find the median. If the number of data points is odd, the median is the middle data point. If the number of data points is even, the median is the average of the middle two data points.

**Step 3:** Find the first quartile (\text{Q}\_1)(Q1​)left parenthesis, start text, Q, end text, start subscript, 1, end subscript, right parenthesis. The first quartile is the median of the data points to the left of the median in the ordered list.

**Step 4:** Find the third quartile (\text{Q}\_3)(Q3​)left parenthesis, start text, Q, end text, start subscript, 3, end subscript, right parenthesis. The third quartile is the median of the data points to the right of the median in the ordered list.

**Step 5:** Calculate IQR by subtracting IQR =Q\_3-Q\_1

#### Example

Find the IQR of these scores: 1, 3, 3, 3, 4, 4, 4, 6, 6

* Total of 9 data, median: 4
* 75% percentile: mid of (4,4,6,6) = (4+6)/2=5
* 25% percentile: mid of (1,3,3,3) = (3+3)/2=3

### Boxplot

Box: Shows median, 75th and 25th percentiles.

Whiskers: range to Inlier data (up to 1.5\*IQR)

Outliers: shown outside the whisker range.

![](https://kr.mathworks.com/help/examples/stats/win64/CreateBoxPlotsForGroupedDataExample_01.png)

### Density Plots and Estimates

Similar to histogram, shows the distribution of data as a **continuous line.**

A smooth histogram can be plotted using a kernel densitiy estimate

* Parametric density estimation
  * normal distribution with std. deviation and mean etc
* Nonparametric density estimation
  * **Kernel Density Estimation**: Nonparametric method for using a dataset to estimating probabilities for new points.

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-3ab8ac1ecdfb26e6e59dcd9c91b07e158b00837e%2Fimage.png?alt=media)

### **Barcharts, Histogram**

bar chart: x-axis represent different categories of a factor variable

* for categorial data

histogram: x-axis represents values of a single variable

### **Correlation**

Positively correlated: when X goes up, then Y goes up

Negatively correlated: when X goes up, then Y goes down

Correlation coefficient

* A metric that measures the extent to which numeric variables are associated with one another (range from -1 to 1)
* One metric is Pearson's correlation coefficient

For two data variables, a scatter plot can be used to check the correlation

* other plots can be heat maps, contour plots

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-deda9f6f071c7285449f7c9c2d143eb15f5c3ef0%2Fimage.png?alt=media)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ykkim.gitbook.io/wiki/machine-learning/probability-and-statistics-for-machine-learning/basics-of-data-analysis.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
