Statistics

Open Data

Life is full of statistical data, so why don’t we take advantage of it? As I live my life, I try to collect as much data as possible whenever I realize there is an opportunity. Here is a list of data I have collected, which may be of use when in need of data to practice skills for statistical analysis or to teach a course for applied statistics.

Weight-loss data
Intermittently in my life, I have tried to lose weight to get into shape. This is a dataset from one of my weight-loss streaks in 2020-2021. There are six variables in the datset.

  1. Date = The dates of data points
  2. kg = Actual weight in kilograms
  3. bmi = Body mass index caculated as kg/(m^2) where m refers to my height (1.72m)
  4. Day = Days since the first day
  5. Colombia = I went to a trip to Colombia between 2020-2021. This variable refers to whether the dates were before or after the trip
  6. Month = The month of the dates.

Sample Analysis

Here is linear regression modeling of the weight loss data. The model regresses kg on Day, Month, and Colombia. I also added an interaction of Day and Month, so the effect of accumulating days can vary among specific months.

rm(list=ls()) # Clear R session
library(readxl) # Read Excel files 
library(sjPlot) # Plotting regression model 
library(ggplot2) # Plotting in general 

d <- read_excel("Weight2020.xlsx") # Load the dataset 
m <- lm(kg ~ Day*Month, d) # Linear model 
summary(m) # See the result 

VariableEstimateStd. Error
Intercept85.5181.000
Day-0.101-0.101
November2.7231.010
December6.4041.359
February-0.7041.146
March-1.0981.228
May3.9986.154
Day:Novem-0.0050.011
Day:Decem-0.1440.024
Day:Feb-0.0000.012
Day:March0.0050.011
Day:May-0.0260.043


You can also visualize the data using the following code:

d$Month <- factor(d$Month, levels = c("November", "December", "February", "March", "April", "May"))
ggplot(d, aes(x = Day, y = kg, color = Month)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Weights in kilograms") +
ylab("Day (1-145)") +
scale_y_continuous(breaks = seq(70, 90, 2.5), limits = c(70, 90)) +
scale_x_continuous(breaks = seq(0, 145, 10), limits = c(0, 145)) +
theme_bw()
 
kg

Sleep data
For 128 days, I collected my sleep data using Fitbit Charge 5 and just made the data publicly available. The file contains the following seven variables. I have additional data regarding which sleep level I was in (a) light sleep, (b) deep sleep, and (c) REM sleep, which to be added soon.

  1. Date = The dates of data points
  2. Time = How long I slept in minutes
  3. Day = Day of the week
  4. Holiday = Whether it was a holiday in Japan,
  5. Weekend = Whether it was on the weekend (Saturday or Sunday)
  6. Score = Fitbit Charge 5 gives the score for each sleep event
  7. Teaching = Whether I had teaching on that day

J-SLARF Stats SIG

Japan Second Language Acquisition Forum has a special interest group on applied statistics, especially to be used in L2 research.

HistoryYearTopicTextbook
S12019-2020Multilevel ModelingGelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York: Cambridge University Press.
S22020-2021Bayesian Data AnalysisMcElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan (2nd ed.). Boca Raton, FL: CRC Press.
S32021-2022Structural Equation ModelingKline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). New York: The Guilford Press.
S42023-2024Applied Regression ModelingGelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. New York: Cambriedge University Press.