🖍️
gitbook_docs
  • Introduction
  • Machine Learning
    • Recommended Courses
      • For Undergrad Research
      • Math for Machine Learning
    • ML Notes
      • Covariance Correlation
      • Feature Selection
      • Linear Regression
      • Entropy, Cross-Entropy, KL Divergence
      • Bayesian Classifier
        • Terminology Review
        • Bayesian Classifier for Normally Distributed classes
      • Linear Discriminant Analysis
      • Logistic Regression
        • Logistic Regression Math
      • Logistic Regression-MaximumLikelihood
      • SVM
        • SVM concept
        • SVM math
      • Cross Validation
      • Parameter, Density Estimation
        • MAP, MLE
        • Gaussian Mixture Model
      • E-M
      • Density Estimation(non-parametric)
      • Unsupervised Learning
      • Clustering
      • kNN
      • WaveletTransform
      • Decision Tree
    • Probability and Statistics for Machine Learning
      • Introduction
      • Basics of Data Analysis
      • Probability for Discrete Random Variable
      • Poisson Distribution
      • Chi-Square Distribution
      • P-value and Statistical Hypothesis
      • Power and Sample Size
      • Hypothesis Test Old
      • Hypothesis Test
      • Multi Armed Bandit
      • Bayesian Inference
      • Bayesian Updating with Continuous Priors
      • Discrete Distribution
      • Comparison of Bayesian and frequentist inference
      • Confidence Intervals for Normal Data
      • Frequenist Methods
      • Null Hypothesis Significance Testing
      • Confidence Intervals: Three Views
      • Confidence Intervals for the Mean of Non-normal Data
      • Probabilistic Prediction
  • Industrial AI
    • PHM Dataset
    • BearingFault_Journal
      • Support Vector Machine based
      • Autoregressive(AR) model based
      • Envelope Extraction based
      • Wavelet Decomposition based
      • Prediction of RUL with Deep Convolution Nueral Network
      • Prediction of RUL with Information Entropy
      • Feature Model and Feature Selection
    • TempCore Journal
      • Machine learning of mechanical properties of steels
      • Online prediction of mechanical properties of hot rolled steel plate using machine learning
      • Prediction and Analysis of Tensile Properties of Austenitic Stainless Steel Using Artificial Neural
      • Tempcore, new process for the production of high quality reinforcing
      • TEMPCORE, the most convenient process to produce low cost high strength rebars from 8 to 75 mm
      • Experimental investigation and simulation of structure and tensile properties of Tempcore treated re
    • Notes
  • LiDAR
    • Processing of Point Cloud
    • Intro. 3D Object Detection
    • PointNet
    • PointNet++
    • Frustrum-PointNet
    • VoxelNet
    • Point RCNN
    • PointPillars
    • LaserNet
  • Simulator
    • Simulator List
    • CARLA
    • Airsim
      • Setup
      • Tutorial
        • T#1
        • T#2
        • T#3: Opencv CPP
        • T#4: Opencv Py
        • Untitled
        • T#5: End2End Driving
  • Resources
    • Useful Resources
    • Github
    • Jekyll
  • Reinforcement Learning
    • RL Overview
      • RL Bootcamp
      • MIT Deep RL
    • Textbook
    • Basics
    • Continuous Space RL
  • Unsupervised Learning
    • Introduction
  • Unclassified
    • Ethics
    • Conference Guideline
  • FPGA
    • Untitled
  • Numerical Method
    • NM API reference
Powered by GitBook
On this page
  • Basic Data Analysis
  • Common terms
  • MAD
  • IQR: Interquartile range
  • Boxplot
  • Density Plots and Estimates
  • Barcharts, Histogram
  • Correlation

Was this helpful?

  1. Machine Learning
  2. Probability and Statistics for Machine Learning

Basics of Data Analysis

PreviousIntroductionNextProbability for Discrete Random Variable

Last updated 3 years ago

Was this helpful?

Basic Data Analysis

Common terms

  • Mean, weighted mean

  • Median, weighted median

  • Percentile

  • Trimmed mean

    • drop a fixed number of sorted values (p smallest and largest values omitted)

  • Robust: not sensitive to extreme values

  • Outlier

  • Mode

    • most frequently occuring value

MAD

MAD(Mean absolute deviation)

Median absolute deviation from median

Variance and Standard deviation can be largely affected by outliers.

  • can use Median absolute deviation

Also, the range from the max and min value can be sensitive to outliers.

  • Use IQR instead

IQR: Interquartile range

Interquartile range is the amount of spread in the middle 50% percent of a dataset.

In other words, it is the distance between the first quartile

IQR=Q3−Q1IQR =Q_3-Q_1IQR=Q3​−Q1​

Here's how to find the IQR:

Step 1: Put the data in order from least to greatest.

Step 2: Find the median. If the number of data points is odd, the median is the middle data point. If the number of data points is even, the median is the average of the middle two data points.

Step 3: Find the first quartile (\text{Q}_1)(Q1​)left parenthesis, start text, Q, end text, start subscript, 1, end subscript, right parenthesis. The first quartile is the median of the data points to the left of the median in the ordered list.

Step 4: Find the third quartile (\text{Q}_3)(Q3​)left parenthesis, start text, Q, end text, start subscript, 3, end subscript, right parenthesis. The third quartile is the median of the data points to the right of the median in the ordered list.

Step 5: Calculate IQR by subtracting IQR =Q_3-Q_1

Example

Find the IQR of these scores: 1, 3, 3, 3, 4, 4, 4, 6, 6

  • Total of 9 data, median: 4

  • 75% percentile: mid of (4,4,6,6) = (4+6)/2=5

  • 25% percentile: mid of (1,3,3,3) = (3+3)/2=3

Boxplot

Box: Shows median, 75th and 25th percentiles.

Whiskers: range to Inlier data (up to 1.5*IQR)

Outliers: shown outside the whisker range.

Density Plots and Estimates

Similar to histogram, shows the distribution of data as a continuous line.

A smooth histogram can be plotted using a kernel densitiy estimate

  • Parametric density estimation

    • normal distribution with std. deviation and mean etc

  • Nonparametric density estimation

    • Kernel Density Estimation: Nonparametric method for using a dataset to estimating probabilities for new points.

Barcharts, Histogram

bar chart: x-axis represent different categories of a factor variable

  • for categorial data

histogram: x-axis represents values of a single variable

Correlation

Positively correlated: when X goes up, then Y goes up

Negatively correlated: when X goes up, then Y goes down

Correlation coefficient

  • A metric that measures the extent to which numeric variables are associated with one another (range from -1 to 1)

  • One metric is Pearson's correlation coefficient

For two data variables, a scatter plot can be used to check the correlation

  • other plots can be heat maps, contour plots