2 Warm-up exercise
Loftus, S. C. (2021). Basic Statistics with R: Reaching Decisions with Data. Retrieved from https://books.google.hu/books?id=vTASEAAAQBAJ
2.1 Data structures
2.1.1 Problems
Consider the following set of attributes about the American Film Institute’s top-five movies ever from their 2007 list.
- What code would you use to create a vector named
Movie
with the valuesCitizen Kane
,The Godfather
,Casablanca
,Raging Bull
, andSinging in the Rain
? (Hints:object <- c()
, Working with character in R) - What code would you use to create a vector — giving the year that the movies in Problem 1 were made — named
Year
with the values 1941, 1972, 1942, 1980, and 1952? - What code would you use to create a vector — giving the run times in minutes of the movies in Problem 1 — named
RunTime
with the values 119, 177, 102, 129, and 103? - What code would you use to find the run times of the movies in hours and save them in a vector called
RunTimeHours
? (Hints: Numeric tranformation) - What code would you use to create a data frame named
MovieInfo
containing the vectors created in Problem 1, Problem 2, and Problem 3? (Hints:data.frame()
)
2.2 Manipulation
2.2.1 Problems
Suppose we have the following data frame named colleges
(download here):
College | Employees | TopSalary | MedianSalary |
---|---|---|---|
William and Mary | 2104 | 425000 | 56496 |
Christopher Newport | 922 | 381486 | 47895 |
George Mason | 4043 | 536714 | 63029 |
James Madison | 2833 | 428400 | 53080 |
Longwood | 746 | 328268 | 52000 |
Norfolk State | 919 | 295000 | 49605 |
Old Dominion | 2369 | 448272 | 54416 |
Radford | 1273 | 312080 | 51000 |
Mary Washington | 721 | 449865 | 53045 |
Virginia | 7431 | 561099 | 60048 |
Virginia Commonwealth | 5825 | 503154 | 55000 |
Virginia Military Institute | 550 | 364269 | 44999 |
Virginia Tech | 7303 | 500000 | 51656 |
Virginia State | 761 | 356524 | 55925 |
- What code would you use to select the first, third, tenth, and twelfth entries in the
TopSalary
vector from theColleges
data frame? (Hints: Indexing with[]
operator) - What code would you use to select the elements of the
MedianSalary
vector where theTopSalary
is greater than $400,000? (Hints:d$MedianSalary[d$TopSalary>400000]
) - What code would you use to select the rows of the data frame for colleges with less than or equal to 1000 employees? (Hints:
d[condition, ]
) - What code would you use to select a sample of 5 colleges from this data frame (there are 14 rows)? (Hints:
d[sample(x = 1:14, size = 5, replace = F),]
)
Suppose we have the following data frame named Countries (download here):
Nation | Region | Population | PctIncrease | GDPcapita |
---|---|---|---|---|
China | Asia | 1409517397 | 0.4 | 8582 |
India | Asia | 1339180127 | 1.1 | 1852 |
United States | North America | 324459463 | 0.7 | 57467 |
Indonesia | Asia | 263991379 | 1.1 | 3895 |
Brazil | South America | 209288278 | 0.8 | 10309 |
Pakistan | Asia | 197015955 | 2.0 | 1629 |
Nigeria | Africa | 190886311 | 2.6 | 2640 |
Bangladesh | Asia | 164669751 | 1.1 | 1524 |
Russia | Europe | 143989754 | 0.0 | 10248 |
Mexico | North America | 129163276 | 1.3 | 8562 |
- What could would you use to select the rows of the data frame that have GDP per capita less than 10000 and are not in the Asia region?
- What code would you use to select a sample of three nations from this data frame (There are 10 rows)?
- What code would you use to select which nations saw a population percent increase greater that 1.5%?
Suppose we have the following data frame named Olympics (download here):
Year | Type | Host | Competitors | Events | Nations | Leader |
---|---|---|---|---|---|---|
1992 | Summer | Spain | 9356 | 257 | 169 | Unified Team |
1992 | Winter | France | 1801 | 57 | 64 | Germany |
1994 | Winter | Norway | 1737 | 61 | 67 | Russia |
1996 | Summer | United States | 10318 | 271 | 197 | United States |
1998 | Winter | Japan | 2176 | 68 | 72 | Germany |
2000 | Summer | Australia | 10651 | 300 | 199 | United States |
2002 | Winter | United States | 2399 | 78 | 78 | Norway |
2004 | Summer | Greece | 10625 | 301 | 201 | United States |
2006 | Winter | Italy | 2508 | 84 | 80 | Germany |
2008 | Summer | China | 10942 | 302 | 204 | China |
2010 | Winter | Canada | 2566 | 86 | 82 | Canada |
2012 | Summer | United Kingdom | 10768 | 302 | 204 | United States |
2014 | Winter | Russia | 2873 | 98 | 88 | Russia |
2016 | Summer | Brazil | 11238 | 306 | 207 | United States |
2018 | Winter | South Korea | 2922 | 102 | 92 | Norway |
- What code would you use to select the rows of the data frame where the host nation was also the medal leader?
- What code would you use to select the rows of the data frame where the number of competitors per event is greater than 35?
- What code would you use to select the rows of the data frame where the number of competing nations in the Winter Olympics is at least 80?
2.3 Packages
2.3.1 Problems
- Install the Ecdat package. (Hints:
install.packages()
) - Say that we previously installed the Ecdat library into R and wanted to call the library to access datasets from it. What code would we use to call the library? (Hints:
library()
) - Say that we then wanted to call the dataset
Diamond
from the Ecdat library. What code would we use to load this dataset into R? (Hints:data()
)
2.4 Frequency and numerical exploratory analyses
2.4.1 Problems
Load the leuk
dataset from the MASS library. This dataset is the survival times (time
), white blood cell count (wbc
), and the presence of a morphologic characteristic of white blood cells (ag
).
- Generate the frequency table for the presence of the morphologic characteristic.
- Find the median and mean for survival time.
- Find the range, IQR, variance, and standard deviation for white blood cell count.
- Find the correlation between white blood cell count and survival time.
Load the survey
dataset from the MASS library. This dataset contains the survey responses of a class of college students.
- Create the contingency table of whether or not the student smoked (
Smoke
) and the student’s exercise regimen (Exer
). (Hints:table()
,DescTools::Desc()
) - Find the mean and median of the student’s heart rate (
Pulse
). (Hints:summary()
,DescTools::Desc()
,psych::describe()
) - Find the range, IQR, variance, and standard deviation for student age (
Age
). - Find the correlation between the span of the student’s writing hand (
Wr.Hnd
) and nonwriting hand (NW.Hnd
). (Hints:cor()
,DescTools::Desc()
)
Load the Housing
dataset from the Ecdat library. This dataset looks at the variables that affect the sales price of houses.
- Create the contingency table of whether or not the house has a recreation room (
recroom
) and whether or not the house had a full basement (fullbase
). - Find the mean and median of the house’s lot size (
lotsize
). - Find the range, IQR, variance, and standard deviation for the sales price (
price
). - Find the correlation between the sales price of the house (
price
) and the number of bedrooms (bedrooms
).
2.5 Graphical exploratory analyses
Load the Star
dataset from the Ecdat library. This dataset looks at the affect on class sizes on student learning.
- Generate the scatterplot of the student’s math score
tmathssk
and reading scoretreadssk
. (Hints:plot()
,ggplot() + geom_point()
) - Generate the histogram of the years of teaching experience
totexpk
. (Hints:hist()
,ggplot() + geom_histogram()
) - Create a new variable in the
Star
dataset calledtotalscore
that is the sum of the student’s math scoretmathssk
and reading scoretreadssk
. (Hints: tranformation) - Generate a boxplot of the student’s total score
totalscore
split out by the class size typeclassk
. (Hints:boxplot()
,ggplot() + geom_boxplot()
)
Load the survey
dataset from the MASS library. This dataset contains the survey responses of a class of college students.
- Generate the scatterplot of the student’s height
Height
and writing hand spanWr.Hnd
. - Generate the histogram of student age
Age
. - Generate a boxplot of the student’s heart rate
Pulse
split out by the student’s exercise regimenExer
.