What Is the Probability of Houston Flooding Again

Yous've likely heard that the flooding in Houston post-obit Hurricane Harvey is reportedly a 500-year or fifty-fifty 1000-twelvemonth flood event. You've perhaps also heard that this is the third 500-year flood that Houston has experienced in a three-year span, which calls into serious question the usefulness or accuracy of the "500-year flood" designation. This made me wonder: what's our bodily best guess of the yearly adventure for a Harvey-level inundation, co-ordinate to the data? That is the question I will attempt to answer here.

As a spoiler, I estimate that Harvey caused roughly fifty – 120 yr alluvion events at most sites in Harris canton (the Texas canton where Houston is located) where we take data—but the estimates depend strongly on the specific site, with estimates at some sites being ordinarily small, and estimates at one site (Vince Bayou at Pasadena, TX) indeed hovering nighthe yard year neighborhood. In the rest of this mail I will draw how I arrived at these estimates. I've posted an R Markdown HTML document which gives all the code for reproducing this assay, plus some additional tidbits.

Disclaimer: I am not a hydrologist. In fact, I know nearly nothing nigh hydrology. What I am is an out-of-work data scientist. I don't merits that the assay I present beneath is specially savvy or sophisticated — it's possible I've overlooked some bug that would be more than obvious to the hydrologically initiated — but I do call back my analysis at least addresses all or most of the major methodological issues that need to be handled in such an analysis.

The Due north-year inundation concept

As the sources I linked in a higher place explain very clearly, theN-year alluvion is a probabilistic notion based on the estimated take a chance that flood waters will reach a sure height in any given yr. In a nutshell, an N-year flood is divers as the flood water meridian that a given area has a i/Due north probability of reaching in a particular year. For example, a 500-year inundation has a probability i/500 or 0.2% chance of occurring in a particular year, by definition. Now, information technology does not automatically follow that an N-yr flood is expected to occur about every N years on average, merely that determination does follow if we add the somewhat reasonable supposition that the overflowing events are independent from one year to the next¹. That assumption is probably not exactly truthful, due to factors similar El Niño / La Niña cycles (or Atlantic Niño in Harvey's case), but probably a good enough approximation that the intuitive interpretation "500-year floods are expected every 500 years" is reasonably true.

Now, information technology would be fantastically unlikely for Houston to feel three 500-twelvemonth floods in iii years if these were "truly" 500-yr floods. A much more likely explanation is that the underlying adventure approximate — 0.2% chance of occurring in a year — is based on a bad statistical model. Nosotros tin can employ the historical stream gage² height data from the U.S. Geological Survey (USGS; meet as well Harris county, TX's very dainty interface) to come with a ameliorate guess.

Here's a very high-level outline of what the analysis looks like:

Query the USGS web server to retrieve the concluding 20 years of cuff height data from all stream gage sites in Harris County, TX.
Detrend the data using a one-year rolling median.
Approximate the cumulative distributive function (CDF) for each site'due south information using kernel methods. Then locate Harvey in the CDFs to obtain the daily risk estimates
Convert the daily take chances estimates to North-year flood estimates by making some assumptions about the grade of temporal dependence.
Bootstrap to obtain estimates of uncertainty.

Retrieving the data is not too exciting, so I won't say much about it in this summary mail other than to point out the USGS Daily Values Web Service, which is a really nice system the USGS has set up upwardly for retrieving nicely formatted data via HTTP requests. (For those interested in this sort of thing, all the details and syntax are documented in the R Markdown document.) Instead we'll leap ahead to the function where we take a dainty, make clean dataset.

Detrending the information

The data are daily mean cuff heights for 21 stream gages spread through Harris Canton, TX. There are more than than 21 gage sites in the county, but these are the ones that had mean gage height data bachelor³. Here's what the data look like at each site:

It generally looks pretty well-behaved, with the obvious exception beingness some huge, discontinuous jumps in gage meridian that appear at some of the sites. The top half dozen sites show the biggest jumps ; as yous can see, these are all (and the only ones) from Cypress Creek. I have no idea what went downwardly at Cypress Creek to cause these discontinuities, or why the jumps don't at to the lowest degree all happen at or around the same fourth dimension.

In principle what we intendance about in measuring these flood levels are the deviations from the typical stream heights, whatsoever those typical heights might be at that time, even if they accept inverse over the years (whether that'southward through a wearisome migrate or a big alter all at one time). In other words, we are interested in the detrended data. Plus, more practically speaking, we want to be able to still use the data prior to those discontinuous jumps in stream top that nosotros saw above. 1 mode or some other, that requires bookkeeping for the jumps past detrending them out.

A rolling median is a good candidate method for a nonparametric estimator of the tendency line hither considering (a) it won't be affected much past big, sudden shocks to the gage height (eastward.yard., from floods), thereby leaving that variability to go into the CDF, which nosotros want; and (b) in theory, a rolling median can model the discontinuous jumps much better than, say, a rolling hateful considering the median is not a "polish" statistic like the mean is.

There is the question of the optimal window width for the rolling median (i.e., how many observations to take the median over in each management). One idea would be to regard this as a hyperparameter to be tuned via cross-validation, simply notation that in this example minimizing the expected prediction error is not actually our goal. Rather, our goal is to remove the biggest, coarsest trends in the data, while leaving all of the variability around that tendency line in the data so that it volition go into the estimated CDF. The trend line that minimizes the expected prediction error would actually follow the data a bit too closely for our purposes, getting pulled upwardly by big flood events and thereby eating up some of the variability that we want to keep. In other words, nosotros want the trend line to underfit the data to some extent.

With a little experimentation I establish that a 1-year window worked pretty well. Here'southward an instance of what that looks similar at a single representative site:

Figure 2: Black lines = data, Red line = rolling median tendency line.

Looks great! The trend line remains relatively flat fifty-fifty in the presence of large local floods, merely notwithstanding perfectly captures the weird discontinuous jump in the time series as well as any very irksome drifts over time.

Let's fit one-year moving medians to all of the sites, detrend the data around those median lines, and then examine the resulting detrended information:

It looks pretty good! There are a handful of odd downward spikes that are probably the result of the median line non capturing the discontinuous jumps as gracefully as possible, merely these seem small plenty not to worry about.

Estimating the CDFs

This stride is pretty straightforward, at least thank you to the sROC packet, which has nice functionality for kernel estimation of CDFs. Here's an instance of what the probability density function (PDF) and CDF expect like for one representative site, and where the Harvey floods autumn in this distribution.

Figure iv: Estimated PDF and CDF for one site.

Okay, so now nosotros can look up the Harvey flood in the CDFs for each site, which volition tell us the proportion of the distribution that is less than the Harvey flood. Let ane minus this proportion be $p_{day}$ , the probability of a alluvion at least as great every bit Harvey on a particular day. Notation that it's for a particular day, not a particular year, considering our data are daily gage heights. Now we simply need to catechumen this to $p_{year}$ , the probability for a particular twelvemonth, and so have $1/p_{year}$ to obtain the North-year alluvion estimate. Then how do we practise the conversion?

Modeling the temporal dependence

If the daily flood events were contained then nosotros could use

$p_{year} = 1 - (1 - p_{day})^{365}$ ,

which is the probability of at least one "success" from a Binomial distribution with 365 trials. But the daily flood events aren't really independent: if there's a flood upshot on one 24-hour interval, that essentially raises the probability that there volition still be a flood upshot on the side by side day, because it takes a few days for the water levels to recede to below flood level. For example, here'southward what the data look like from the large floods in Houston last twelvemonth, summertime 2016, over again from our favorite representative gage site:

Figure v: Cuff information from Summer 2022 flooding almost Houston. In that location are v *Northward*-year flood days in a row.

I drew a hypothetical flood height level to correspond a item N-yr overflowing, just for illustrative purposes (I have no idea what flood level this line would actually correspond to). We can encounter that in this detail case, we had a string of v Northward-year alluvion days in a row. More more often than not, this should get in articulate that we expect the daily inundation events to exist dependent in that they tend to come in clusters. And then to convert $p_{day}$ to $p_{year}$ for Harvey, we demand a model of this dependence.

A simple model is that the daily overflowing events show Markov or one-day-back dependence: the probability of a inundation consequence today depends on the presence of a flood upshot yesterday, but is conditionally independent of flooding 2 days ago given the flood status yesterday. For this model nosotros add a parameter $p_1$ : the provisional probability of a overflowing event, given that there was a overflowing event the previous day. Then the conversion formula from to a higher place becomes^four

$p_{year} = 1 - (1 - p_{day})\big[1 - p_{day}(1-p_1)/(1-p_{day})\big]^{364}$ ,

which reduces to the previous formula when $p_1 =p_{day}$ , meaning the events are independent.

To become a feel for how this new $p_1$ parameter affects the model estimates, nosotros tin plot the N-year estimates from all 21 sites every bit a function of the assumed value of $p_1$ (with $p_{day}$ set up to the previously estimated values):

Figure half dozen: Northward-year flood estimates each of the 21 sites, as a function of the dependence ( $p_1$ ) assumptions.

Figure half dozen: Northward-year flood estimates each of the 21 sites, as a function of the dependence ( $p_1$ ) assumptions.

Every bit you lot can come across, the causeless value of $p_1$ doesn't accept a huge affect until it gets up to around, say, 0.6 or so, merely and then it starts to thing quite a lot. To remember about what plausible values of $p_1$ might be, it helps to notice that, according to our dependence model, flood durations follow a Geometric distribution with parameter $1-p_1$ , implying that the expected flood elapsing is $1/(1-p_1)$ days. So, for example, when $p_1=0.75$ , then we expect it to take an average of 4 days before water levels recede to below the relevant Due north-twelvemonth flood level (but non necessarily back to normal levels).

We've already adult a skillful judge of $p_{day}$ using the kernel CDFs. To estimate $p_1$ , basically we can (a) choose a particular flood elevation, $F$ ; (b) categorize the cuff heights at whatever given time signal every bit being either higher up or below $F$ , giving united states $H^*_t=1$ or $H^*_t=0$ , respectively; and (c) create a contingency table of $H^*_t$ vs. $H^*_{t-1}$ (the $H^*$ value on the previous day), which gives us a table like:

from which nosotros tin can compute an judge of $p_1$ as $d/(b+d)$ . Now this estimate, of course, depends on the inundation height $F$ that we chose, and it'due south certainly possible that in that location could really be different values of $p_1$ for different inundation heights. So hither we plot the estimated values of $p_1$ for all 21 sites, for $F$ heights ranging from the 90th percentile to whatever percentile corresponds to the Harvey floods at each site (in other words, $1-p_{day}$ for that site):

Figure eight: Dependence/clustering ( $p_1$ ) estimates for each of the 21 sites, as a function of the flood summit ( $F$ ) in quantile units. The estimates are predictions from a weighted GAM fit to the simple estimates (in logit-logit infinite) that consequence from the procedure illustrated past Figure 7, with weights proportional to the number of data points below $F$ . The correct end-points of each curve are the flood quantiles at which Harvey is located for that site; note that these are equal to $1-p_{day}$ .

Figure eight: Dependence/clustering ( $p_1$ ) estimates for each of the 21 sites, as a function of the flood summit ( $F$ ) in quantile units. The estimates are predictions from a weighted GAM fit to the simple estimates (in logit-logit infinite) that consequence from the procedure illustrated past Figure 7, with weights proportional to the number of data points below $F$ . The correct end-points of each curve are the flood quantiles at which Harvey is located for that site; note that these are equal to $1-p_{day}$ .

Immediately we see that at that place is considerable variation across gages in their level of dependence, and that the degree of dependence depends on the inundation height (in an interesting, non-monotonic mode). Because the $p_1$ estimates depend adequately strongly on the site and on the meridian of flooding, it wouldn't be a good idea to utilize an amass $p_1$ estimate that averages across all sites and/or all overflowing heights. Instead, for ourDue north-yr flood estimates we volition use each site'southward individual $p_1$ estimate at the value $F$ respective to the Harvey overflowing height at that site.

We at present have point estimates of $p_{day}$ and $p_1$ for all sites, which means we could technically now compute our N-year overflowing estimates for each site. Only to use only the signal estimates would be to ignore the substantial uncertainty associated with these estimates! Furthermore, because we obtain the N-year estimates from a convex transformation of $p_{day}$ and $p_1$ , using merely the point estimates would lead u.s.a. to systematically underestimate the N-year flood levels (due to Jensen's inequality). So we will use bootstrapping to preserve the variability of the estimates.

Bootstrapping to business relationship for uncertainty

Because our data are fourth dimension series, a elementary nonparametric bootstrap is non a expert idea because the temporal structure of the data (such as the autocorrelation) would exist destroyed in each bootstrap sample, which would ruin our $p_1$ estimates (although the $p_{day}$ estimates would probably all the same be okay). So here we will utilise a more than specialized bootstrapping method designed to work with time series data, called the stationary bootstrap. And then we will basically treat each bootstrap-resampled distribution equally a poor man's posterior distribution and go from there. Hither's what we get when we run the stationary bootstrap on a single, representative site:

…Huh. The bootstrap results are not totally crazy, but there are clearly a couple of issues. First and most evidently, the bootstrap estimates of $p_{day}$ have come out sort of discretized for some reason—there are only half dozen unique values—and with a substantial fraction of the estimates equal to exactly nothing (which makes them unusable). Exercise we really think the posterior distribution for $p_{day}$ truly has this kind of shape? Well, no, probably not. So, if you'll indulge me in some absolutely advert-hoc adjustment, I'd like to apply a correction to "smooth out" the marginal distribution of the $p_{day}$ estimates. You lot tin can call back of this as applying a bit of regularization "past hand."

Now, I'1000 willing to trust that the mean and standard deviation of this distribution are more or less correct; it's only the shape that I don't trust. So I'll apply a two-stage⁵ transformation that reshapes the empirical distribution into the Beta distribution with the aforementioned mean and variance. Note that we "reshape" the existing distribution of values, rather than only taking new draws from the corresponding Beta distribution, so that we preserve the form of the dependence betwixt the $p_{day}$ and $p_1$ estimates. So here's a graphical comparison of the old and new marginal distributions:

Now if nosotros turn our attending to the bootstrap estimates of $p_1$ , we encounter a less obvious only even so somewhat strange issue, which is that the conditional $p_1$ distribution at $p_{day}=0$ has a weird bimodal shape—in particular, there is an odd cluster of estimates that are effectively 0. Closer inspection of the $p_1$ estimates as a office of the flood height $F$ sheds some light on what's going on here:

Figure 11. Left panel: Dependence/clustering ( $p_1$ ) estimates equally a office of the inundation height ( $F$ ) in quantile units. Dark-green line = Estimates from the actual dataset, Blackness/gray lines = Estimates from k bootstrap-resampled datasets. Correct panel: Density of bootstrap estimates at the uppermost overflowing quantile (the quantile where the Harvey floods are). Blueish/red lines = Estimates from Gaussian mixture model separating the "truthful" bootstrap distribution from contaminant/divergent estimates.

Figure 11. Left panel: Dependence/clustering ( $p_1$ ) estimates equally a office of the inundation height ( $F$ ) in quantile units. Dark-green line = Estimates from the actual dataset, Blackness/gray lines = Estimates from k bootstrap-resampled datasets. Correct panel: Density of bootstrap estimates at the uppermost overflowing quantile (the quantile where the Harvey floods are). Blueish/red lines = Estimates from Gaussian mixture model separating the "truthful" bootstrap distribution from contaminant/divergent estimates.

Recall from the Figure 8 caption that the $p_1$ estimates are predicted values from a weighted GAM fit (on the logit-logit scale) to the simple estimates resulting from the procedure illustrated Effigy seven, because the simple estimates are rough and unstable, peculiarly at extreme values of the flood top $F$ . Basically what is happening here is that, for a nontrivial minority of the bootstrap-resampled datasets, the unproblematic estimates get empirically undefined (i.e., they involve partition by zilch) above a certain value of $F$ , because past chance the resampled dataset ended upwards with no data points near or across that value of $F$ . So when that happens, the GAM simply does a linear extrapolation from the highest non-missing $F$ to the $F$ value respective to Harvey, and uses the extrapolated value as the bootstrap estimate. That works fine in a lot of cases, but in other cases information technology can yield wacky/divergent estimates, depending on where along the $F$ sequence the unproblematic estimates become missing.

Information technology seems like a good idea to discard the divergent bootstrap estimates. I way to do this is to fit a mixture model to the bootstrap estimates, use the model to allocate each estimate equally either probably coming from the "true" bootstrap distribution or probably coming from a (much wider) contaminant distribution, and then to discard the estimates classified as probably being contaminants. That procedure is illustrated in the right panel of Figure 11, which shows that information technology works pretty well, at to the lowest degree in the sense that the mixture model seems pretty plausible.

Afterward applying this correction, forth with the previously described aligning to the $p_{day}$ estimates, we end upwards with an adjusted articulation distribution of bootstrap estimates that looks like this:

So if we compute the adjusted bootstrap distributions for all 21 sites, and then after much ado, that gives united states of america the following final estimates for each site:

Effigy xiii. Final bootstrap distributions for all sites. Lines = medians, shaded regions = 50% highest posterior density (HPD) intervals.

Decision and limitations

As y'all tin see, almost of the North-yr inundation estimates have a lot of uncertainty effectually them. Estimating the adventure of extremely rare events is an inherently hard statistical problem, so it's not surprising that information technology's hard to pin downwardly precise estimates.

Of course, the method I've applied here is just ane manner to approach the problem. Basically it's the first matter I thought of. I initially intended this to be a rough-and-set analysis to satisfy my curiosity, only after following the rabbit hole of Markov dependence and bootstrapping complications a flake farther than I anticipated, I'm not then sure if the "crude-and-fix" description is even so apt.

There is a temptation to effort to aggregate in some way the individual North-year flood estimates at each site, in an attempt to estimate the N-twelvemonth inundation level of the Harvey floods as a single, atomic event. Later all, media reports of the gamble/severity of the Harvey floods simply present a single effigy like "500 yr alluvion" or "1000 year flood," not broken down by site. This kind of aggregation is problematic in at to the lowest degree a couple ways. Offset, on a more conceptual level, it's non clear which sites should exist aggregated over. A simple approach might exist to simply amass over all the sites in Harris county, which is sort of what I did (I mean, I avoided aggregation, but I did only analyze all the sites in Harris canton). According to my girlfriend—who, different me, really does know something near hydrology—what would probably make more sense would be to aggregate over all the sites within a detail watershed, ignoring geologically arbitrary canton lines. 2d, on a more methodological level, the statistical trouble of estimating the probability of a rare consequence across the entire constellation of 21 sites is quite formidable. For example, a simple extension of my method would involve kickoff estimating a 21-dimensional articulation density of the overflowing heights at all gage sites… the thought of which is enough to make ane cringe. (I really spent quite a while trying to get this to piece of work, but ultimately just didn't trust the answers I was getting, so I decided not to post any of that hither.)

This method I used here does not take into account the fairly obvious possibility that the yearly risk of Harvey-level alluvion events may non exist abiding, but instead may be increasing over fourth dimension. Including such time-dependence in the model would make this already difficult statistical problem fifty-fifty more challenging. This increasing-take chances hypothesis is certainly plausible, just is quite complicated to remember near because it depends on a lot of factors, such as the rate and nature of current/future urban development in the Houston area, preparatory measures that may be taken up by Houston, the possibly increasing frequency of large hurricanes due to climatic change, etc. Regarding this concluding indicate, information technology seems that the bear witness for an increasing frequency of Category four/five hurricanes is rather mixed.

Finally, there is some reason to think that almost any method applied immediately in the wake of Hurricane Harvey will lead to some degree of overestimation of the probability of like events in the time to come (or, equivalently, Due north-year inundation estimates that are too small). This point is actually quite subtle, merely bear with me. Hypothetically, beyond many possible timelines, if nosotros ever waited for a detail effect (such as a flood of a certain size) to occur and and then estimated the probability of that event immediately after it occurred, our probability estimates would end up inflated to some extent (on average across the timelines) due to a form of selection bias that seems related to the recently renewed discussion of hot mitt effects. For a simplified model of how this selection bias would work, consider a state of affairs where we are observing a sequence of tosses of an unfair coin that comes up heads (H) with probability $h$ , and nosotros construct an estimate $\hat{h}$ by waiting until we discover the first H then computing the proportion of H outcomes in the sequence. Then

$\text{E}[\hat{h}] = \sum_{k=1}^{\infty}G(k,h)/k$ ,

where $G()$ is the probability mass function for a Geometric distribution. On average, this exceeds $h$ when $h$ is less than well-nigh 0.35 or then, and by quite a lot when H is rare. For example, if $h=.01$ , then $\text{E}[\hat{h}] \approx .046$ , then that the estimates enlarge the probability of heads by well-nigh a factor of 5 on average. This is not exactly what'due south going on in our Harvey analysis, but it's quite similar. Information technology's difficult to know what to exercise (if annihilation) almost this kind of choice bias, but it'due south interesting to recall most.

Footnotes

ⁱ More specifically, that floods follow a Bernoulli procedure over years or a Poisson process in continuous time

^two The USGS consistently uses the spelling "gage" instead of the more than familiar (to me) "gauge." Your guess equally to why is every bit good every bit mine. Anyway, I'll follow the USGS's spelling convention.

³ Notation that I use the daily means hither, even though arguably it might make more sense to use the daily maximums. But unfortunately the daily maximums are only available for 3 of the gages in Harris county.

⁴ See the R Markdown document for the derivation of this.

^v The first footstep of the transformation—transforming the observed distribution to a Uniform distribution—is accomplished in this case by taking $X' = \text{rank}(X)/(n+1)$ , where $n$ is the number of $X$ values, and breaking ties randomly so that the six unique values get a shine distribution.

gageablight.blogspot.com

Source: http://jakewestfall.org/blog/index.php/2017/09/25/what-is-the-yearly-risk-of-another-harvey-level-flood-in-houston/

What Is the Probability of Houston Flooding Again

The Due north-year inundation concept

Detrending the information

Estimating the CDFs

Modeling the temporal dependence

Bootstrapping to business relationship for uncertainty

Decision and limitations

Footnotes

0 Response to "What Is the Probability of Houston Flooding Again"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel