Quantitative Data Analysis
Autor: Nikhil Raj • May 1, 2018 • Case Study • 3,691 Words (15 Pages) • 747 Views
Department of Computer Science
CS5606 Quantitative Data Analysis
Coursework 2017/18
Submitted by: Nikhil Raj
Student id: 1725485
Contents
1.0 Task -1 Description: 2
2.0 Scope: 2
3.0 Missing data: 2
4.0 Variables: 2
5.0 Relationships between variables and dependencies 6
6.0 Task 2: The Chi Squared Test 7
7.0 Task 3: Descriptive Analysis: 9
8.0 References: 12
1.0 Task -1 Description:
The data set used for this exercise is called Orange Juice (OJ) and contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. Several characteristics of the customer and product are recorded.
2.0 Scope:
The ISLR package was installed to load the OJ data set on to R, using the following command.
- install.packages("ISLR")
- library("ISLR")
For this report, as only 10 variables are to be included a subset of the OJ data was created called oj using the following command:
- oj <- OJ [c(1:10)]
Only 10 out of 18 variables provided in the data set are expected to be used for this exercise, the data set contain 1070 observations.
3.0 Missing data:
Before starting the assigned tasks, the OJ data set was checked for any missing data or any anomalies within the data set as this is a very important aspect of any data analysis since missing data could result in a biased outputs or estimates. Agresti A. (2007) explains that analyses in the instance of missing data should be completed with due caution. As a bare minimum, the results should be compared using all cluster available rather than being only clusters with no missing observation, if the results are substantially different then only tentative conclusion should be drawn till the time the reasons for missing data has be identified and studied.
The results of the missing data check were completed by using the command below along with the results.
- write.csv(unique (is.na(oj)),file ="missingdata.csv")
[pic 1]
Table: 3.1
Since there is no missing information in the data set it is safe to assume that the available data set is complete and can be used for statistical analysis with missing data playing no role in the results.
4.0 Variables:
The data set includes 10 variables to be used for this exercise each variable has its types statistical was run to summarise data types for all this variable. To get a high-level overview of the all the variables and to summarise data points the summary command was used to produce the table below.
...