Chapter 1 Introdcution to Statistics and R
1.1 Statistics
1.1.1 What is Statistics?
The term has got three different meanings.
- Plural of the term statistic , which refers to any function of sample values, for example, \(\bar x = \frac {\sum_i^n x_i} n\)
- Table of values
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
- Technique of dealing with data
- Collection
- Organization
- Analysis (such as regression analysis)
- Interpretation
- Presentation
1.1.2 Statistics vs Data Science
Statistics deals mainly with analyzing and interpreting data, while data science deals more with predictive analytics. (more)
1.1.3 Statistics vs Mathematics
Consider the following equations
- \(Y = X + 0.2 \times X\)
Say, \(X\) = Basic Salary of an employee in a company, while \(Y\) is computed from \(X\), with adding to \(X\) 20% of \(X\). In this scenario, given the values of \(X\), we can always tell what the gross salary would be.
What if we have an equation such as the following:
\(Y = X + 0.1 \times R\)
Where, \(R\) is the revenue the company earns through the employee. In this case, the salary of the employee would vary in each month.
The salaries from month to month would unpredictably vary, which is where statistics comes in. Statistics deals with randomness, situations where we cannot exactly tell which outcome we might get. We may (or may not) know the possible outcomes (like when tossing a coin, we know the possible outcome, but which will happen)
1.1.4 Other Statistical Concepts
They will be briefly explained later when the relevant R
codes are mentioned. They include various charts, concepts of central tendency and dispersion, correlation, regression, test of hypothesis, among others.
1.2 Introdcution to R
1.2.1 Why R?
R is the most popular programming language for statistical analysis, second most popular for machine learning.
Reasons at a glance
- Free and Open Source Software (FOSS)
- Big Community
- Made by statisticians for statisticians
- Easy to use codes
- Stunning graphics, esp. with ggplot2
- Reproducibility
- Runs on a wide array of platforms, including but not limited to Windows, Linux, and Mac OS X.
1.2.2 Who Use R?
R is both used in academia and industry.
- Good analyses for theses are now accomplished using R.
- Industries heavily rely on R for statistical analysis, predictive analytics, and machine learning.
Some of the renowned companies using R are:
- Google (effective advertising and forecasting)
- Facebook (behavioral analysis)
- Twitter (sensitivity analysis)
- Microsoft
- Uber
- Airbnb
- IBM
- ANZ
- HP
- Ford
1.2.3 Who Developed R?
R was developed by Ross Ihaka and Robert Gentleman (statistician from New Zealand and Canada, respectively).
1.2.4 Other Languages and Packages
Some other languages for data analysis are:
- Python
- Julia
- Java
- Scala
Packages
- SPSS
- STATA
- Eviews
1.2.5 Installing R and Rstudio
1.2.6 Start Writing R Code (Windows, Linux, and Command Line)
- Using R Console directly: write code and press enter (NOT a good method)
- Using Rstudio Console: Equivalent to using R console
- Using R Script from Rstudio: to run, press
Ctrl + Enter
It is best to use Rstudio.
1.2.7 Effectively Using Rstudio
- Keep things organized
- Make a project Put all codes, data, and output inside that project directory.
- Use
View
function to view data tables, for exampleView(iris)
, which displays theiris
data set.
1.2.8 R Script
An R script is a convenient tool to organize a work. A project may consist of several or many such scripts. They can be easily shared with others. An R script has the extension .R
or .r
.
Tips
- Use line gaps often to separate code segments
- Add comments explaining codes, so others (including future you) can understand what they mean.
1.2.8.1 Quoting R codes from another R file.
source('r_file.R')
Thus, you can use functions, data, variables etc. defined or saved elsewhere.
1.2.9 R Documentation (Help)
To get help, type ?keyword
or help(keyword)
For example, ?mean
would show options and examples for the mean function.
1.2.10 Handling Error
- If the code is not run, and shows a
+
sign, it means the code is not complete yet; complete it or pressesc
to start over. - If the error message shows
could not find function ...
, correct the function name. - If you do not understand the error message, copy and paste it to your browser search bar, and see what help the community has to offer.
1.2.11 R Packages
R packages are extensions of base R, providing some very useful tasks. Many R packages made R more popular and useful, such as ggplot2
, karet
, and rmarkdown
.
To install a package, run install.packages("package_name")
, for example install.packages("tidyverse")
installs the package tidyverse
. When installing, the package name must be enclosed within quotation marks (" ").
Before being able to make use of a package, one must load the package, by running library(package_name)
, for example, to load ggplot2
, run library(ggplot2)
, this time without quotation marks (" ").
1.2.12 R Mathematical Operations
- Make a table: Purpose, code, example, output
1.2.12.1 Arithmetic Operators
Purpose | Operator | Example | Output |
---|---|---|---|
Addition | + |
2+3 |
5 |
Subtraction | - |
10-9 |
1 |
Multiplication | * |
10*8 |
80 |
Division | / |
10/5 |
2 |
Exponent | ^ or ** |
10^2 |
100 |
Modulus (Remainder) | %% |
10%%4 |
2 |
Integer Division | %/% |
12%/%5 |
2 |
1.2.12.2 Relational Operators
Purpose | Operator | Example | Output |
---|---|---|---|
Less than | < |
2<3 |
5 |
Greater than | > |
10>11 |
1 |
Less than or equal to | <= |
10<=8 |
80 |
Greater than or equal to | >= |
10>=5 |
2 |
Equal to | == |
10^2==100 |
100 |
Not equal to | != |
100!=99 |
2 |
1.2.12.3 Logical Operators
1.2.12.4 Mathematical Functions
1.2.13 Assigning Values
Variables make it easy to assign values and use them later.
- To assign values to variables, you can use either
=
or<-
, but in R,<-
is preferred. In Rstudio, pressingalt + -
is a very good shortcut for correctly typing<-
. - Comments start with
hash
(#)
Example
## [1] 7
## [1] 12
## [1] 9
## [1] 81
## [1] 81
## [1] 1.098612
## [1] 1.099
1.2.13.1 Round, Floor, and Ceiling
Suppose, we have a number 3.9856
round
rounds the number;
## [1] 3.986
celing
switches the number to the next integer;
## [1] 4
floor
gives the previous integer.
## [1] 3
-
celing
andfloor
always give integer output.
1.2.14 Generating Multiple Numbers
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 3 5 7 9 11 13 15 17 19
## [1] 1.00 13.25 25.50 37.75 50.00
1.2.15 Data Types
- Logical
- Numeric
- integer
- Double
- Character
1.2.16 Learn More
- Stat Mania artciles and link to contents
- Books
- Coursera, Edx, and other MOOCs.
1.2.17 Vector
A vector is set of similar items. In Linear Algebra, it is defined as a matrix with only one column or one row. It could contain numbers of different types, strings, or logical values.
A vector makes it easy to simultaneously operate on multiple items.
- We make a vector when we are dealing with only one variable.
- A vector can contain only one type of values, such ac numeric, logical etc.
A vector in R
is usually made using c
, which stands for concatenate. A vector can also be made using seq
command shown earlier, or by using a colon
(:
) sign, if the values are successive integers.
x <- c(4, 5, 7)
a <- 10:12
y <- c("red", "green", "blue", "black", "orange")
z <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
1.2.17.1 Adding Vectors
If a scalar (a single value) is added to a vector, it would be added to values.
If two (or more) vectors with equal lengths are added together, corresponding values would be added; the same goes for almost any other mathematical operation (such as subtraction or division).
If, however, the lengths are unequal, the values of the smaller vector would be repeated from the beginning.
## [1] 7 8 10
## [1] 14 16 19
## Warning in x + b: longer object length is not a multiple of shorter object
## length
## [1] 10 12 13
1.2.17.2 Indexing Vectors
Using
[]
:## [1] 4 5 7
## [1] 5
## [1] 5 7
## [1] 4 7
## [1] 5 7
## [1] 5
Using
Logical
## [1] 4 5
## [1] "red" "green" "blue" "black" "orange"
## [1] TRUE FALSE TRUE TRUE FALSE
## [1] "red" "blue" "black"
1.2.17.3 Changing Value(s) of A Vector
1.2.17.4 Sorting
1.2.18 Matrix
A matrix a rectangular array of similar items. Although it has more than two rows and columns, it can only contain items of a single type.
Contents from Jafar Sir
1.2.19 Data Frame
A Data frame contains many variables; each variable can be different type. Distinct variables are placed in columns and values/observations are in rows.
Example
mpg | cyl | disp | hp | drat | wt | |
---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 |
1.2.19.1 Making A New Data Frame
data.frame
command is used to produce a data frame.
- Length of each variable must be equal.
1.2.20 List
A list can contain scalars, vectors, matrices, data frames, as well as other lists!
1.2.21 Functions
A function is used to
- avoid repetitive tasks and mistakes therefrom
- find values from a complicated formula
A function to compute Harmonic Mean (HM)
Formula: Reciprocal of Mean of \(\frac{1}{x_i}\)
Reciprocal of \(\frac{\frac{1}{x_1}+\frac{1}{x_2}+...+\frac{1}{x_n}}{n}\)
Thus, \(HM = \frac{n}{\sum \frac{1}{x_i}} =\frac 1 {\text{Mean of 1/x}}\)
We have, x = 4, 5, 7
Therefore,
## [1] 1.686747
Since this function is actually a one-liner, we can write it as
1.2.22 Loops (Alternatives and Comparison with Other Languages)
In R, loops are rarely used.