Puffins:

Regression models for (strictly) periodic signals in time-series data

This work is motivated by a specific workflow that I wanted to generalise.

For context, I am interested in finding pulsations (star quakes that manifest as time-series brightness variations) in stars that exist in binary systems (pairs of stars) - specifically I want to find pulsations that are modified by tides in eclipsing binary systems. In another project, I discuss the details of how we normally model the time-series signature of eclipsing binary stars, so I’ll skip that here. While that procedure is certainly necessary, it is immensely time and resource intensive. So, instead, we want to come up with an effective modelling procedure that robustly models the strictly periodic signature of the eclipsing binary and removes it from the time series. To do this, I created the python package: puffins. puffins uses a regression model with a feature embedding and weighting to flexibly and efficiently model the eclipse signal generated by binary stars.

The science:

An initial study of U Gru identified a series of low-amplitudes pulsations hidden beneath the eclipse signal - I’ll walk through the initial workflow and highlight why I wanted to find a faster way of doing things.

The typical procedure is to calculate a Fourier transform or periodogram of the data to analyse the time series’ frequency content. However, just naively doing this for U Gru would reveal a forest of harmonics (integer multiples of the orbital frequency) because the eclipse signal is so highly non-sinusoidal (as seen in the periodogram on the left).

The pulsations are at much lower amplitudes (< 500 ppm) compared to the binary signal (>100,000 ppm), so they are effectively masked by the binary signal. We could only see the pulsations by removing the binary signal. While this is normally a well constrained problem (see this page), U Gru is a complicated case where we can barely detect the light from the second star, and therefore have a poorly constrained model of the eclipse signal. So, to avoid injecting bias into our results with a poorly constrained model, we instead opted to use a rudimentary sum of sine waves at each harmonic that we iteratively removed one at a time. This allowed us to find some pulsations at low amplitudes and high frequencies (residuals periodogram plot on the right) that we never would have been able to detect without removing the binary signal.

The data:

In this case, we’re working with a time series of brightness measurements (a light curve) obtained by NASA’s TESS telescope. The telescope regularly takes data every 2 minutes, producing the light curve seen in the figure to our left, which is a lightcurve (time-series brightness measurements) of the eclipsing binary star system U Gru (great name). In addition to the rather large gap in the middle of the dataset, you’ll also likely notice the very regular signal present in the data. In total, we have 15,000 data points taken every two minutes (with exception to the gaps) for 27.9 days.

Since the two stars in a binary system are gravitationally bound to one another (like the Earth around the Sun) they orbit one another with a regular period, in this case, once every 1.885 days. As the two stars orbit around one another, they are positioned such that they partially block each other out, leading to the repeated dips in brightness (flux) in the y-axis.

The problem

The main issue arises from our choice of method for removing the binary signal. In the case of U Gru, the low-amplitude signal of interest also happens to be very closely spaced in frequency to the harmonic signal (red vertical lines in the plot above), so we have to be very careful with how we remove the eclipse signal so that we don’t influence the other signal. In our discovery paper, we used the brute force iterative modelling of the harmonics one at a time. In our follow-up analysis, we used a bespoke method where we phase-folded the time series data so that it fell onto the cyclic period of the orbit (using the formula: phase = (time - t0) / period % 1, where t0 is an arbitrary reference time). Doing this places the data between 0 and 1 on the x-axis with periodic boundaries. After folding the data, we divided the data unevenly into bins where we calculated the average per bin and then interpolated between bins to get an effective interpolation model.

Although this method worked, it was very time consuming to decide how many bins to use (too many and we pick up noisy substructure; too few and we smooth out the eclipse signal), and where to put them since the eclipse signal has both smoothly undulating and sharply varying features. Doing this poorly will result in a bad model that doesn’t fully remove the eclipse signal, and therefore makes it difficult or impossible to identify interesting signals in the residual! Thus, this workflow isn’t really useful on large scales. Furthermore, the general case of modelling binaries is highly degenerate without additional radial velocity and spectroscopic data, which we don’t have for the VAST majority of eclipsing binaries. So, the real question is, how do we build flexible, efficient, and generalisable methods for modelling the full breadth of signals produced by eclipsing binary stars?

Regression to the rescue!

Image Credit: IMSL

We often think of regression in the simplest case: fitting a line to data. However, regression can be very flexible in finding any form of relationship between some predictor (X) and some set of outputs (Y). What’s more is that we can use a transformation of our predictors (X) to create a feature embedding that allows us to model our data with any form of basis that we want. In this case, we know that we are dealing with strictly periodic signals in eclipsing binaries, so we want to use a periodic basis - the natural choice of which is a Fourier basis. We take our times t and expand them into pairs: cos(2π f t) sin(2π f t) for each harmonic that we want to consider. So, for K harmonics, our design matrix X has 2K + 1 columns - 2K for each pair of sine and cosine terms per harmonic, and 1 additional column of ones for the offset term.

While we can solve this with an ordinary least-squares, the higher order harmonics can have wildly oscillating behaviour trying to hit every data point since they’re not otherwise penalised. We could use a Ridge Regression, which applies a uniform penalty to all regression terms, but this doesn’t always behave well (as we can see in some examples that I provide in the notebooks). Instead, we will use a feature weighting to accompany the feature embedding that we’re performing! Since we are working in a Fourier basis, it makes sense to use a feature weighting that preferentially penalizes higher order harmonic terms as kj→K

In this case, our Feature Weighted regression solution is: βFW=(XTWX+Λ)-1XTWY,

Here, the diagonal components of Λ are given by [Λ-1]jj = (1+ω2j s2)2, where s is a parameter that controls the effective width of our feature, and ωj is the (angular) frequency at the kj-th harmonic: ωj=kj⋅2πf. Thus, we see that the weight of the feature decreases as the harmonic and therefore frequency increases.

Results

Now that we know what puffins does mathematically, let's apply it to our example of U Gru.

On the right, we can see an example of how puffins flexibly models (red) the highly non-sinusoidal eclipse signal (grey).

Below, we can see that when we subtract the model out, we are left with the red periodogram of the residuals, displaying the pulsation signal that we were after -- all without a full eclipse model. Importantly, we can observe that this is a really well fit model because we don't have excess power in any harmonics of the orbital frequency!

Here are some jupyter notebooks exploring the properties and use-cases of the puffins workflow.

Next
Next

Time series modelling with Markov Chain Monte Carlo