05 May 2016
For a few years now I have given a guest lecture on time series analysis in our School’s Environmental Epidemiology course. The basic thrust of this lecture is that you should generally ignore what you read about time series modeling, either in papers or in books. The reason is because I find much of the time series literature is not particularly helpful when doing analyses in a biomedical or population health context, which is what I do almost all the time.
Prediction vs. Inference
First, most of the literature on time series models tends to assume that you are interested in doing prediction—forecasting future values in a time series. I almost am never doing this. In my work looking at air pollution and mortality, the goal is never to find the best model that predicts mortality. In particular, if our goal were to predict mortality, we would probably never include air pollution as a predictor. This is because air pollution has an inherently weak association with mortality at the population, whereas things like temperature and other seasonal factors tend to have a much stronger association.
What I am interested in doing is estimating an association between changes in air pollution levels and mortality and making some sort of inference about that association, either to a broader population or to other time periods. The challenges in these types of analyses include estimating weak associations in the presence of many stronger signals and appropriately adjusting for any potential confounding variables that similarly vary over time.
The reason the distinction between prediction and inference is important is that focusing on one vs. the other can lead you to very different model building strategies. Prediction modeling strategies will always want you to include into the model factors that are strongly correlated with the outcome and explain a lot of the outcome’s variation. If you’re trying to do inference and use a prediction modeling strategy, you may make at least two errors:
- You may conclude that your key predictor of interest (e.g. air pollution) is not important because the modeling strategy didn’t deem to include it
- You may omit important potential confounders because they have a weak releationship with the outcome (but maybe have a strong relationship with your key predictor). For example, one class of potential confounders in air pollution studies is other pollutants, which tend to be weakly associated with mortality but may be strongly associated with your pollutant of interest.
Random vs. Fixed
Another area where I feel much time series literature differs from my practice is on the whether to focus on fixed effects or random effects. Most of what you might think of when you think of time series models (i.e. AR models, MA models, GARCH, etc.) focuses on modeling the random part of the model. Often, you end up treating time series data as random because you simply do not have any other data. But the reality is that in many biomedical and public health applications, patterns in time series data can be explained by clearly understood fixed patterns.
For example, take this time series here. It is lower at the beginning and at the end of the series, with higher level sin the middle of the period.
It’s possible that this time series could be modeled with an auto-regressive (AR) model or maybe an auto-regressive moving average (ARMA) model. Or it’s possible that the data are exhibiting a seasonal pattern. It’s impossible to tell from the data whether this is a random formulation of this pattern or whether it’s something you’d expect every time. The problem is that we usually onl have one observation from teh time series. That is, we observe the entire series only once.
Now take a look at this time series. It exhibits some of the same properties as the first series: it’s low at the beginning and end and high in the middle.
Should we model this as a random process or as a process with a fixed pattern? That ultimately will depend on the what type of data this is and what we know about it. If it’s air pollution data, we might do one thing, but if it’s stock market data, we might do a totally different thing.
If one were to see replicates of the time series, we’d be able to resolve the fixed vs. random question. For example, because I simulated the data above, I can simulate another replicate and see what happens. In the plot below I show two replications from each of the processes.
It’s clear now that the time series on the top row has a fixed “seasonal” pattern while the time series on the bottom row is random (in fact it is simulated from an AR(1) model).
The point here is that I think very often we know things about the time series that we’re modeling that we know introduced fixed variation into the data: seasonal patterns, day-of-week effects, and long-term trends. Furthermore, there may be other time-varying covariates that can help predict whatever time series we’re modeling and can be put into the fixed part of the model (a.k.a regression modeling). Ultimately, when many of these fixed components are accounted for, there’s relatively little of interest left in the residuals.
What to Model?
So the question remains: What should I do? The short answer is to try to incorporate everything that you know about the data into the fixed/regression part of the model. Then take a look at the residuals and see if you still care.
Here’s a quick example from my work in air pollution and mortality. The data are all-cause mortality and PM10 pollution from Detroit for the years 1987–2000. The question is whether daily mortaliy is associated with daily changes in ambient PM10 levels. We can try to answer that with a simple linear regression model:
lm(formula = death ~ pm10, data = ds)
Min 1Q Median 3Q Max
-26.978 -5.559 -0.386 5.109 34.022
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.978826 0.112284 418.394 <2e-16
pm10 0.004885 0.001936 2.523 0.0117
Residual standard error: 8.03 on 5112 degrees of freedom
Multiple R-squared: 0.001244, Adjusted R-squared: 0.001049
F-statistic: 6.368 on 1 and 5112 DF, p-value: 0.01165
PM10 appears to be positively associated with mortality, but when we look at the autocorrelation function of the residuals, we see
If we see a seasonal-like pattern in the auto-correlation function, then that means there’s a seasonal pattern in the residuals as well. Not good.
But okay, we can just model the seasonal component with an indicator of the season.
lm(formula = death ~ season + pm10, data = ds)
Min 1Q Median 3Q Max
-25.964 -5.087 -0.242 4.907 33.884
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.830458 0.215679 235.676 < 2e-16
seasonQ2 -4.864167 0.304838 -15.957 < 2e-16
seasonQ3 -6.764404 0.304346 -22.226 < 2e-16
seasonQ4 -3.712292 0.302859 -12.258 < 2e-16
pm10 0.009478 0.001860 5.097 0.000000358
Residual standard error: 7.649 on 5109 degrees of freedom
Multiple R-squared: 0.09411, Adjusted R-squared: 0.09341
F-statistic: 132.7 on 4 and 5109 DF, p-value: < 2.2e-16
Note that the coefficient for PM10, the coefficient of real interest, gets a little bigger when we add the seasonal component.
When we look at the residuals now, we see
The seasonal pattern is gone, but we see that there’s positive autocorrelation at seemingly long distances (~100s of days). This is usually an indicator that there’s some sort of long-term trend in the data. Since we only care about the day-to-day changes in PM10 and mortality, it would make sense to remove any such long-term trend. I can do that by just including the date as a linear predictor.
lm(formula = death ~ season + date + pm10, data = ds)
Min 1Q Median 3Q Max
-23.407 -5.073 -0.375 4.718 32.179
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.04317325 0.64858433 92.576 < 2e-16
seasonQ2 -4.76600268 0.29841993 -15.971 < 2e-16
seasonQ3 -6.56826695 0.29815323 -22.030 < 2e-16
seasonQ4 -3.42007191 0.29704909 -11.513 < 2e-16
date -0.00106785 0.00007108 -15.022 < 2e-16
pm10 0.00933871 0.00182009 5.131 0.000000299
Residual standard error: 7.487 on 5108 degrees of freedom
Multiple R-squared: 0.1324, Adjusted R-squared: 0.1316
F-statistic: 156 on 5 and 5108 DF, p-value: < 2.2e-16
Now we can look at the autocorrelation function one last time.
The ACF trails to zero reasonably quickly now, but there’s still some autocorrelation at short lags up to about 15 days or so.
Now we can engage in some traditional time series modeling. We might want to model the residuals with an auto-regressive model over order p. What should p be? We can check by looking at the partial autocorrelation function (PACF).
The PACF seems to suggest we should fit an AR(6) or AR(7) model. Let’s use an AR(6) model and see how things look. We can use the
arima() function for that.
arima(x = y, order = c(6, 0, 0), xreg = m, include.mean = FALSE)
ar1 ar2 ar3 ar4 ar5 ar6 (Intercept)
0.0869 0.0933 0.0733 0.0454 0.0377 0.0489 59.8179
s.e. 0.0140 0.0140 0.0141 0.0141 0.0140 0.0140 1.0300
seasonQ2 seasonQ3 seasonQ4 date pm10
-4.4635 -6.2778 -3.2878 -0.0011 0.0096
s.e. 0.4569 0.4624 0.4546 0.0001 0.0018
sigma^2 estimated as 53.69: log likelihood = -17441.84, aic = 34909.69
Note that the coefficient for PM10 hasn’t changed much from our initial models. The usual concern with not accounting for residual autocorrelation is that the variance/standard error of the coefficient of interest will be affected. In this case, there does not appear to be much of a difference between using the AR(6) to account for the residual autocorrelation and ignoring it altogether. Here’s a comparison of the standard errors for each coefficient.
Naive AR model
(Intercept) 0.648584 1.030007
seasonQ2 0.298420 0.456883
seasonQ3 0.298153 0.462371
seasonQ4 0.297049 0.454624
date 0.000071 0.000114
pm10 0.001820 0.001819
The standard errors for the
pm10 variable are almost identical, while the standard errors for the other variables are somewhat bigger in the AR model.
Ultimately, I’ve found that in many biomedical and public health applications, time series modeling is very different from what I read in the textbooks. The key takeaways are:
Make sure you know if you’re doing prediction or inference. Most often you will be doing inference, in which case your modeling strategies will be quite different.
Focus separately on the fixed and random parts of the model. In particular, work with the fixed part of the model first, incorporating as much information as you can that will explain variability in your outcome.
Model the random part appropriately, after incorporating as much as you can into the fixed part of the model. Classical time series models may be of use here, but also simple robust variance estimators may be sufficient.
27 Apr 2016
Annika Salzberg is currently a biology undergraduate at Haverford College majoring in biology. While in high-school here in Baltimore she developed and taught an R class to her classmates at the Park School. Her interest in R grew out of a project where she and her fellow students and teachers went to the Canadian sub-Arctic to collect data on permafrost depth and polar bears. When analyzing the data she learned R (with the help of a teacher) to be able to do the analyses, some of which she did on her laptop while out in the field.
Later she worked on developing a course that she felt was friendly and approachable enough for her fellow high-schoolers to benefit. With the help of Steven Salzberg and the folks here at the JHU DSL, she built a class she calls R for the intimidated which just launched on DataCamp and you can take for free!
The class is a great introduction for people who are just getting started with R. It walks through R/Rstudio, package installation, data visualization, data manipulation, and a final project. We are super excited about the content that Annika created working here at Hopkins and think you should go check it out!
21 Apr 2016
Editor’s note - This is a chapter from my book How to be a modern scientist where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.
Writing - what should I do and why?
Write using collaborative software to avoid version control issues.
On almost all modern scientific papers you will have co-authors. The traditional way of handling this was to
create a single working document and pass it around. Unfortunately this system always results in a long collection of
versions of a manuscript, which are often hard to distinguish and definitely hard to synthesize.
An alternative approach is to use formal version control systems like Git and Github. However, the overhead for using these systems can be pretty heavy for paper authoring. They also require
all parties participating in the writing of the paper to be familiar with version control and the command line.
Alternative paper authoring tools are now available that provide some of the advantages of version control without the major overhead involved
in using base version control systems.
Make figures the focus of your writing
Once you have a set of results and are ready to start writing up the paper the first thing is not to write. The first thing you should do is create a set of 1-10 publication-quality plots with 3-4 as the central focus (see Chapter 10 here for more information on how to make plots). Show these to someone you trust to make sure they “get” your story before proceeding. Your writing should then be focused around explaining the story of those plots to your audience. Many people, when reading papers, read the title, the abstract, and then usually jump to the figures. If your figures tell the whole story you will dramatically increase your audience. It also helps you to clarify what you are writing about.
Write clearly and simply even though it may make your papers harder to publish.
Learn how to write papers in a very clear and simple style. Whenever you can write in plain English and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. Referees are trained to find things to criticize and by simplifying your argument they will assume that what you have done is easy or just like what has been done before. This can be extremely frustrating during the peer review process. But the peer review process isn’t the end goal of publishing! The point of publishing is to communicate your results to your community and beyond so they can use them. Simple, clear language leads to much higher use/reading/citation/impact of your work.
Include links to code, data, and software in your writing
Not everyone recognizes the value of re-analysis, scientific software, or data and code sharing. But it is the fundamental cornerstone of the modern scientific process to make all of your materials easily accessible, re-usable and checkable. Include links to data, code, and software prominently in your abstract, introduction and methods and you will dramatically increase the use and impact of your work.
Give credit to others
In academics the main currency we use is credit for publication. In general assigning authorship and getting credit can be a very tricky component of the publication process. It is almost always better to err on the side of offering credit. A very useful test that my advisor John Storey taught me is if you are embarrassed to explain the authorship credit to anyone who was on the paper or not on the paper, then you probably haven’t shared enough credit.
WYSIWYG software: Google Docs and Paperpile.
This system uses Google Docs for writing and Paperpile for reference management. If you have a Google account you can easily create documents and share them with your collaborators either privately or publicly. Paperpile allows you to search for academic articles and insert references into the text using a system that will be familiar if you have previously used Endnote and Microsoft Word.
This system has the advantage of being a what you see is what you get system - anyone with basic text processing skills should be immediately able to contribute. Google Docs also automatically saves versions of your work so that you can flip back to older versions if someone makes a mistake. You can also easily see which part of the document was written by which person and add comments.
- Set up accounts with Google and with Paperpile. If you are an
academic the Paperpile account will cost $2.99 a month, but there is a 30 day free trial.
- Go to Google Docs and create a new document.
- Set up the Paperpile add-on for Google Docs
- In the upper right hand corner of the document, click on the Share link and share the document with your collaborators
- Start editing
- When you want to include a reference, place the cursor where you want the reference to go, then using the Paperpile menu, choose
insert citation. This should give you a search box where you can search by Pubmed ID or on the web for the reference you want.
- Once you have added some references use the Citation style option under Paperpile to pick the citation style for the journal you care about.
- Then use the Format citations option under Paperpile to create the bibliography at the end of the document
The nice thing about using this system is that everyone can easily directly edit the document simultaneously - which reduces conflict and difficulty of use. A disadvantage is getting the formatting just right for most journals is nearly impossible, so you will be sending in a version of your paper that is somewhat generic. For most journals this isn’t a problem, but a few journals are sticklers about this.
Typesetting software: Overleaf or ShareLatex
An alternative approach is to use typesetting software like Latex. This requires a little bit more technical expertise since you need
to understand the Latex typesetting language. But it allows for more precise control over what the document will look like. Using Latex
on its own you will have many of the same issues with version control that you would have for a word document. Fortunately there are now
“Google Docs like” solutions for editing latex code that are readily available. Two of the most popular are Overleaf and ShareLatex.
In either system you can create a document, share it with collaborators, and edit it in a similar manner to a Google Doc, with simultaneous editing. Under both systems you can save versions of your document easily as you move along so you can quickly return to older versions if mistakes are made.
I have used both kinds of software, but now primarily use Overleaf because they have a killer feature. Once you have
finished writing your paper you can directly submit it to some preprint servers like arXiv or biorXiv and even some journals like Peerj or f1000research.
- Create an Overleaf account. There is a free version of the software. Paying $8/month will give you easy saving to Dropbox.
- Click on New Project to create a new document and select from the available templates
- Open your document and start editing
- Share with colleagues by clicking on the Share button within the project. You can share either a read only version or a read and edit version.
Once you have finished writing your document you can click on the Publish button to automatically submit your paper to the available preprint servers and journals. Or you can download a pdf version of your document and submit it to any other journal.
Writing - further tips and issues
When to write your first paper
As soon as possible! The purpose of graduate school is (in some order):
- Time to discover new knowledge
- Time to dive deep
- Opportunity for leadership
- Opportunity to make a name for yourself
- Get a job
The first couple of years of graduate school are typically focused on (1) teaching you all the technical skills you need and (2) data dumping as much hard-won practical experience from more experienced people into your head as fast as possible.
After that one of your main focuses should be on establishing your own program of research and reputation. Especially for Ph.D. students it can not be emphasized enough no one will care about your grades in graduate school but everyone will care what you produced. See for example, Sherri’s excellent guide on CV’s for academic positions.
I firmly believe that R packages and blog posts can be just as important as papers, but the primary signal to most traditional academic communities still remains published peer-reviewed papers. So you should get started on writing them as soon as you can (definitely before you feel comfortable enough to try to write one).
Even if you aren’t going to be in academics, papers are a great way to show off that you can (a) identify a useful project, (b) finish a project, and (c) write well. So the first thing you should be asking when you start a project is “what paper are we working on?”
What is an academic paper?
A scientific paper can be distilled into four parts:
- A set of methodologies
- A description of data
- A set of results
- A set of claims
When you (or anyone else) writes a paper the goal is to communicate clearly items 1-3 so that they can justify the set of claims you are making. Before you can even write down 4 you have to do 1-3. So that is where you start when writing a paper.
How do you start a paper?
The first thing you do is you decide on a problem to work on. This can be a problem that your advisor thought of or it can be a problem you are interested in, or a combination of both. Ideally your first project will have the following characteristics:
- Solves a scientific problem
- Gives you an opportunity to learn something new
- Something you feel ownership of
- Something you want to work on
Points 4 and 5 can’t be emphasized enough. Others can try to help you come up with a problem, but if you don’t feel like it is your problem it will make writing the first paper a total slog. You want to find an option where you are just insanely curious to know the answer at the end, to the point where you just have to figure it out and kind of don’t care what the answer is. That doesn’t always happen, but that makes the grind of writing papers go down a lot easier.
Once you have a problem the next step is to actually do the research. I’ll leave this for another guide, but the basic idea is that you want to follow the usual data analytic process:
- Define the question
- Get/tidy the data
- Explore the data
- Build/borrow a model
- Perform the analysis
- Check/critique results
- Write things up
The hardest part for the first paper is often knowing where to stop and start writing.
How do you know when to start writing?
Sometimes this is an easy question to answer. If you started with a very concrete question at the beginning then once you have done enough analysis to convince yourself that you have the answer to the question. If the answer to the question is interesting/surprising then it is time to stop and write.
If you started with a question that wasn’t so concrete then it gets a little trickier. The basic idea here is that you have convinced yourself you have a result that is worth reporting. Usually this takes the form of between 1 and 5 figures that show a coherent story that you could explain to someone in your field.
In general one thing you should be working on in graduate school is your own internal timer that tells you, “ok we have done enough, time to write this up”. I found this one of the hardest things to learn, but if you are going to stay in academics it is a critical skill. There are rarely deadlines for paper writing (unless you are submitting to CS conferences) so it will eventually be up to you when to start writing. If you don’t have a good clock, this can really slow down your ability to get things published and promoted in academics.
One good principle to keep in mind is “the perfect is the enemy of the very good” Another one is that a published paper in a respectable journal beats a paper you just never submit because you want to get it into the “best” journal.
A note on “negative results”
If the answer to your research problem isn’t interesting/surprising but you started with a concrete question it is also time to stop and write. But things often get more tricky with this type of paper as most journals when reviewing papers filter for “interest” so sometimes a paper without a really “big” result will be harder to publish. This is ok!! Even though it may take longer to publish the paper, it is important to publish even results that aren’t surprising/novel. I would much rather that you come to an answer you are comfortable with and we go through a little pain trying to get it published than you keep pushing until you get an “interesting” result, which may or may not be justifiable.
How do you start writing?
- Once you have a set of results and are ready to start writing up the paper the first thing is not to write. The first thing you should do is create a set of 1-4 publication-quality plots (see Chapter 10 here). Show these to someone you trust to make sure they “get” your story before proceeding.
- Start a project on Overleaf or Google Docs.
- Write up a story around the four plots in the simplest language you feel you can get away with, while still reporting all of the technical details that you can.
- Go back and add references in only after you have finished the whole first draft.
- Add in additional technical detail in the supplementary material if you need it.
- Write up a reproducible version of your code that returns exactly the same numbers/figures in your paper with no input parameters needed.
What are the sections in a paper?
Keep in mind that most people will read the title of your paper only, a small fraction of those people will read the abstract, a small fraction of those people will read the introduction, and a small fraction of those people will read your whole paper. So make sure you get to the point quickly!
The sections of a paper are always some variation on the following:
- Title: Should be very short, no colons if possible, and state the main result. Example, “A new method for sequencing data that shows how to cure cancer”. Here you want to make sure people will read the paper without overselling your results - this is a delicate balance.
- Abstract: In (ideally) 4-5 sentences explain (a) what problem you are solving, (b) why people should care, (c) how you solved the problem, (d) what are the results and (e) a link to any data/resources/software you generated.
- Introduction: A more lengthy (1-3 pages) explanation of the problem you are solving, why people should care, and how you are solving it. Here you also review what other people have done in the area. The most critical thing is never underestimate how little people know or care about what you are working on. It is your job to explain to them why they should.
- Methods: You should state and explain your experimental procedures, how you collected results, your statistical model, and any strengths or weaknesses of your proposed approach.
- Comparisons (for methods papers): Compare your proposed approach to the state of the art methods. Do this with simulations (where you know the right answer) and data you haven’t simulated (where you don’t know the right answer). If you can base your simulation on data, even better. Make sure you are simulating both the easy case (where your method should be great) and harder cases where your method might be terrible.
- Your analysis: Explain what you did, what data you collected, how you processed it and how you analysed it.
- Conclusions: Summarize what you did and explain why what you did is important one more time.
- Supplementary Information: If there are a lot of technical computational, experimental or statistical details, you can include a supplement that has all of the details so folks can follow along. As far as possible, try to include the detail in the main text but explained clearly.
The length of the paper will depend a lot on which journal you are targeting. In general the shorter/more concise the better. But unless you are shooting for a really glossy journal you should try to include the details in the paper itself. This means most papers will be in the 4-15 page range, but with a huge variance.
Note: Part of this chapter appeared in the Leek group guide to writing your first paper