Chapter 1 Introdcution to Statistics and R
1.1 Statistics
1.1.1 What is Statistics?
The term has got three different meanings.
 Plural of the term statistic , which refers to any function of sample values, for example, \(\bar x = \frac {\sum_i^n x_i} n\)
 Table of values
Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species 

5.1  3.5  1.4  0.2  setosa 
4.9  3.0  1.4  0.2  setosa 
4.7  3.2  1.3  0.2  setosa 
4.6  3.1  1.5  0.2  setosa 
5.0  3.6  1.4  0.2  setosa 
5.4  3.9  1.7  0.4  setosa 
 Technique of dealing with data
 Collection
 Organization
 Analysis (such as regression analysis)
 Interpretation
 Presentation
1.1.2 Statistics vs Data Science
Statistics deals mainly with analyzing and interpreting data, while data science deals more with predictive analytics. (more)
1.1.3 Statistics vs Mathematics
Consider the following equations
 \(Y = X + 0.2 \times X\)
Say, \(X\) = Basic Salary of an employee in a company, while \(Y\) is computed from \(X\), with adding to \(X\) 20% of \(X\). In this scenario, given the values of \(X\), we can always tell what the gross salary would be.
What if we have an equation such as the following:
\(Y = X + 0.1 \times R\)
Where, \(R\) is the revenue the company earns through the employee. In this case, the salary of the employee would vary in each month.
The salaries from month to month would unpredictably vary, which is where statistics comes in. Statistics deals with randomness, situations where we cannot exactly tell which outcome we might get. We may (or may not) know the possible outcomes (like when tossing a coin, we know the possible outcome, but which will happen)
1.1.4 Other Statistical Concepts
They will be briefly explained later when the relevant R
codes are mentioned. They include various charts, concepts of central tendency and dispersion, correlation, regression, test of hypothesis, among others.
1.2 Introdcution to R
1.2.1 Why R?
R is the most popular programming language for statistical analysis, second most popular for machine learning.
Reasons at a glance
 Free and Open Source Software (FOSS)
 Big Community
 Made by statisticians for statisticians
 Easy to use codes
 Stunning graphics, esp. with ggplot2
 Reproducibility
 Runs on a wide array of platforms, including but not limited to Windows, Linux, and Mac OS X.
1.2.2 Who Use R?
R is both used in academia and industry.
 Good analyses for theses are now accomplished using R.
 Industries heavily rely on R for statistical analysis, predictive analytics, and machine learning.
Some of the renowned companies using R are:
 Google (effective advertising and forecasting)
 Facebook (behavioral analysis)
 Twitter (sensitivity analysis)
 Microsoft
 Uber
 Airbnb
 IBM
 ANZ
 HP
 Ford
1.2.3 Who Developed R?
R was developed by Ross Ihaka and Robert Gentleman (statistician from New Zealand and Canada, respectively).
1.2.4 Other Languages and Packages
Some other languages for data analysis are:
 Python
 Julia
 Java
 Scala
Packages
 SPSS
 STATA
 Eviews
1.2.5 Installing R and Rstudio
1.2.6 Start Writing R Code (Windows, Linux, and Command Line)
 Using R Console directly: write code and press enter (NOT a good method)
 Using Rstudio Console: Equivalent to using R console
 Using R Script from Rstudio: to run, press
Ctrl + Enter
It is best to use Rstudio.
1.2.7 Effectively Using Rstudio
 Keep things organized
 Make a project Put all codes, data, and output inside that project directory.
 Use
View
function to view data tables, for exampleView(iris)
, which displays theiris
data set.
1.2.8 R Script
An R script is a convenient tool to organize a work. A project may consist of several or many such scripts. They can be easily shared with others. An R script has the extension .R
or .r
.
Tips
 Use line gaps often to separate code segments
 Add comments explaining codes, so others (including future you) can understand what they mean.
1.2.8.1 Quoting R codes from another R file.
source('r_file.R')
Thus, you can use functions, data, variables etc. defined or saved elsewhere.
1.2.9 R Documentation (Help)
To get help, type ?keyword
or help(keyword)
For example, ?mean
would show options and examples for the mean function.
1.2.10 Handling Error
 If the code is not run, and shows a
+
sign, it means the code is not complete yet; complete it or pressesc
to start over.  If the error message shows
could not find function ...
, correct the function name.  If you do not understand the error message, copy and paste it to your browser search bar, and see what help the community has to offer.
1.2.11 R Packages
R packages are extensions of base R, providing some very useful tasks. Many R packages made R more popular and useful, such as ggplot2
, karet
, and rmarkdown
.
To install a package, run install.packages("package_name")
, for example install.packages("tidyverse")
installs the package tidyverse
. When installing, the package name must be enclosed within quotation marks (" ").
Before being able to make use of a package, one must load the package, by running library(package_name)
, for example, to load ggplot2
, run library(ggplot2)
, this time without quotation marks (" ").
1.2.12 R Mathematical Operations
 Make a table: Purpose, code, example, output
1.2.12.1 Arithmetic Operators
Purpose  Operator  Example  Output 

Addition  + 
2+3 
5 
Subtraction   
109 
1 
Multiplication  * 
10*8 
80 
Division  / 
10/5 
2 
Exponent  ^ or ** 
10^2 
100 
Modulus (Remainder)  %% 
10%%4 
2 
Integer Division  %/% 
12%/%5 
2 
1.2.12.2 Relational Operators
Purpose  Operator  Example  Output 

Less than  < 
2<3 
5 
Greater than  > 
10>11 
1 
Less than or equal to  <= 
10<=8 
80 
Greater than or equal to  >= 
10>=5 
2 
Equal to  == 
10^2==100 
100 
Not equal to  != 
100!=99 
2 
1.2.12.3 Logical Operators
1.2.12.4 Mathematical Functions
1.2.13 Assigning Values
Variables make it easy to assign values and use them later.
 To assign values to variables, you can use either
=
or<
, but in R,<
is preferred. In Rstudio, pressingalt + 
is a very good shortcut for correctly typing<
.  Comments start with
hash
(#)
Example
## [1] 7
## [1] 12
## [1] 9
## [1] 81
## [1] 81
## [1] 1.098612
## [1] 1.099
1.2.13.1 Round, Floor, and Ceiling
Suppose, we have a number 3.9856
round
rounds the number;
## [1] 3.986
celing
switches the number to the next integer;
## [1] 4
floor
gives the previous integer.
## [1] 3

celing
andfloor
always give integer output.
1.2.14 Generating Multiple Numbers
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 3 5 7 9 11 13 15 17 19
## [1] 1.00 13.25 25.50 37.75 50.00
1.2.15 Data Types
 Logical
 Numeric
 integer
 Double
 Character
1.2.16 Learn More
 Stat Mania artciles and link to contents
 Books
 Coursera, Edx, and other MOOCs.
1.2.17 Vector
A vector is set of similar items. In Linear Algebra, it is defined as a matrix with only one column or one row. It could contain numbers of different types, strings, or logical values.
A vector makes it easy to simultaneously operate on multiple items.
 We make a vector when we are dealing with only one variable.
 A vector can contain only one type of values, such ac numeric, logical etc.
A vector in R
is usually made using c
, which stands for concatenate. A vector can also be made using seq
command shown earlier, or by using a colon
(:
) sign, if the values are successive integers.
x < c(4, 5, 7)
a < 10:12
y < c("red", "green", "blue", "black", "orange")
z < c(TRUE, FALSE, TRUE, TRUE, FALSE)
1.2.17.1 Adding Vectors
If a scalar (a single value) is added to a vector, it would be added to values.
If two (or more) vectors with equal lengths are added together, corresponding values would be added; the same goes for almost any other mathematical operation (such as subtraction or division).
If, however, the lengths are unequal, the values of the smaller vector would be repeated from the beginning.
## [1] 7 8 10
## [1] 14 16 19
## Warning in x + b: longer object length is not a multiple of shorter object
## length
## [1] 10 12 13
1.2.17.2 Indexing Vectors
Using
[]
:## [1] 4 5 7
## [1] 5
## [1] 5 7
## [1] 4 7
## [1] 5 7
## [1] 5
Using
Logical
## [1] 4 5
## [1] "red" "green" "blue" "black" "orange"
## [1] TRUE FALSE TRUE TRUE FALSE
## [1] "red" "blue" "black"
1.2.17.3 Changing Value(s) of A Vector
1.2.17.4 Sorting
1.2.18 Matrix
A matrix a rectangular array of similar items. Although it has more than two rows and columns, it can only contain items of a single type.
Contents from Jafar Sir
1.2.19 Data Frame
A Data frame contains many variables; each variable can be different type. Distinct variables are placed in columns and values/observations are in rows.
Example
mpg  cyl  disp  hp  drat  wt  

Mazda RX4  21.0  6  160.0  110  3.90  2.620 
Mazda RX4 Wag  21.0  6  160.0  110  3.90  2.875 
Datsun 710  22.8  4  108.0  93  3.85  2.320 
Hornet 4 Drive  21.4  6  258.0  110  3.08  3.215 
Hornet Sportabout  18.7  8  360.0  175  3.15  3.440 
Valiant  18.1  6  225.0  105  2.76  3.460 
Duster 360  14.3  8  360.0  245  3.21  3.570 
Merc 240D  24.4  4  146.7  62  3.69  3.190 
Merc 230  22.8  4  140.8  95  3.92  3.150 
Merc 280  19.2  6  167.6  123  3.92  3.440 
1.2.19.1 Making A New Data Frame
data.frame
command is used to produce a data frame.
 Length of each variable must be equal.
1.2.20 List
A list can contain scalars, vectors, matrices, data frames, as well as other lists!
1.2.21 Functions
A function is used to
 avoid repetitive tasks and mistakes therefrom
 find values from a complicated formula
A function to compute Harmonic Mean (HM)
Formula: Reciprocal of Mean of \(\frac{1}{x_i}\)
Reciprocal of \(\frac{\frac{1}{x_1}+\frac{1}{x_2}+...+\frac{1}{x_n}}{n}\)
Thus, \(HM = \frac{n}{\sum \frac{1}{x_i}} =\frac 1 {\text{Mean of 1/x}}\)
We have, x = 4, 5, 7
Therefore,
## [1] 1.686747
Since this function is actually a oneliner, we can write it as
1.2.22 Loops (Alternatives and Comparison with Other Languages)
In R, loops are rarely used.