Loading

- Proving life exists on Earth
- Combinatorics, just beyond the basics
- 10 best rational approximations for pi
- Fixed points of logistic function
- Line art
- Causal inference and cryptic syntax
- Making a career out of the chain rule
- Robustness and tests for equal variance
- Ellipsoid geometry and Haumea
- Two-sample t-test and robustness
- Spectral sparsification
- Reciprocals of primes
- Rise and fall of the Windows Empire
- Robust statistics
- Optimal low-rank matrix approximation
- Least squares solutions to over- or underdetermined systems
- Computing SVD and pseudoinverse
- Probit regression
- Sensitivity of logistic regression prediction on coefficients
- Tridiagonal systems, determinants, and natural cubic splines
- Probability of coprime sets
- The quadratic formula and low-precision arithmetic
- Off by one character
- New Twitter account: BasicStatistics
- Review of Matrix Mathematics
- Moore-Penrose pseudoinverse is not an adjoint
- It’s like this other thing except …
- Obesity index: Measuring the fatness of probability distribution tails
- Duffing equation for nonlinear oscillator
- Surface area of an egg

NASA'S GALILEO MISSION was primarily designed to explore Jupiter and its moons. In 1989, the Galileo probe started out traveling away from Jupiter in order to do a gravity assist swing around Venus. About a year later it also did a gravity assist maneuver around Earth. Carl Sagan suggested that when passing Earth, the Galileo probe should turn its sensors on Earth to look for signs of life. [1]
Now obviously WE KNOW THERE'S LIFE ON EARTH. But if we're going look for life on other planets, it's reasonable to ask that our methods return positive results when examining the one planet we know for sure does host life. So scientists looked at the data from Galileo as if it were coming from another planet to see what patterns in the data might indicate life.
I've started using looking for life on Earth as a METAPHOR. I'm working on a project right now where I'm looking for a needle in a haystack, or rather _another_ needle in a haystack: I knew that one needle existed before I got started. So I want to make sure that my search procedure at least finds the one positive result I already know exists. I explained to a colleague that we need to make sure we can at least find life on Earth.
This reminds me of SIMULATION WORK. You make up a scenario and treat it as the state of nature. Then pretend you don't know that state, and see how successful your method is at discovering that state. It's sort of a schizophrenic way of thinking, pretending that half of your brain doesn't know what the other half is doing.
It also reminds me of SOFTWARE TESTING. The most trivial tests can be surprisingly effective at finding bugs. So you write a few tests to confirm that there's life on Earth.
RELATED POSTS
* Small course corrections
* Key fobs and interstellar space
* Low rank matrix approximation
[1] I found out about Galileo's Earth reconnaissance listening to the latest episode of the .NET Rocks! podcast.

Most basic combinatorial problems can be solved in terms of multiplication, permutations, and combinations. The next step beyond the basics, in my experience, is counting selections with replacement. Often when I run into a problem that is not quite transparent, it boils down to this.
EXAMPLES OF SELECTION WITH REPLACEMENT
Here are three problems that reduce to counting selections with replacement.
LOOKING AHEAD IN AN EXPERIMENT
For example, suppose you're running an experiment in which you randomize to _n_ different treatments and you want to know how many ways the next _k_ subjects can be assigned to treatments. So if you had treatments _A_, _B_, and _C_, and five subjects, you could assign all five to _A_, or four to _A_ and one to _B_, etc. for a total of 21 possibilities. Your choosing 5 things from a set of 3, with replacement because you can (and will) assign the same treatment more than once.
PARTIAL DERIVATIVES
For another example, if you're taking the _k_th order partial derivatives of a function of _n_ variables, you're choosing _k_ things (variables to differentiate with respect to) from a set of _n_ (the variables). Equality of mixed partials for smooth functions says that all that matters is how many times you differentiate with respect to each variable. Order doesn't matter: differentiating with respect to _x_ and then with respect to _y_ gives the same result as taking the derivatives in the opposite order, as long as the function you're differentiating has enough derivatives. I wrote about this example here.
SUMS OVER EVEN TERMS
I recently had an expression come up that required summing over _n_ distinct variables, each with a non-negative even value, and summing to 2_k_. How many ways can that be done? As many as dividing all the variable values in half and asking that they sum to _k_. Here the thing being chosen the variable, and since the indexes sum to _k_, I have a total of _k_ choices to make, with replacement since I can chose a variable more than once. So again I'm choosing _k_ things with replacement from a set of size _n_.
FORMULA
I wrote up a set of notes on sampling with replacement that I won't duplicate here, but in a nutshell:
The symbol on the left is Richard Stanley's suggested notation for the number of ways to select _k_ things with replacement from a set of size _n_. It turns out that this equals the expression on the right side. The derivation isn't too long, but it's not completely obvious either. See the aforementioned notes for details.
I started by saying basic combinatorial problems can be reduced to multiplication, permutations, and combinations. Sampling with replacement can be reduced to a combination, the right side of the equation above, but with non-obvious arguments. Hence the need to introduce a new symbol, the one on the right, that maps directly to problem statements.
ENUMERATING POSSIBILITIES
Sometimes you just need to count how many ways one can select _k_ things with replacement from a set of size _n_. But often you need to actually enumerate the possibilities, say to loop over them in software.
In conducting a clinical trial, you may want to enumerate all the ways pending data could turn out. If you're going to act the same way, no matter how the pending data work out, there's no need to wait for the missing data before proceeding. This is something I did many times when I worked at MD Anderson Cancer Center.
When you evaluate a multivariate Taylor series, you need to carry out a sum that has a term for corresponding to each partial derivative. You could naively sum over all possible derivatives, not taking into account equality of mixed partials, but that could be very inefficient, as described here.
The example above summing over even partitions of an even number also comes out of a problem with multivariate Taylor series, one in which odd terms drop out by symmetry.
To find algorithms for enumerating selections with replacement, see Knuth's TAOCP, volume 4, Fascicle 3.
RELATED POSTS
* Richard Stanley's 12-fold way (pdf)
* Big derivatives
* Irrelevant uncertainty

It's easy to create rational approximations for π. Every time you write down π to a few decimal places, that's a rational approximation. For example, 3.14 = 314/100. But that's not the _best_ approximation.
Think of the denominator of your fraction as something you have to buy. If you have enough budget to buy a three-digit denominator, then you're better off buying 355/113 rather than 314/100 because the former is more accurate. The approximation 355/113 is accurate to 6 decimal places, whereas 314/100 is only accurate to 2 decimal places.
There's a way to find the most economical rational approximations to π, or any other irrational number, using continued fractions. If you're interested in details, see the links below.
Here are the 10 best rational approximations to π, found via continued fractions.
|----------------+----------| | Fraction | Decimals | |----------------+----------| | 3 | 0.8 | | 22/7 | 2.9 | | 333/106 | 4.1 | | 355/113 | 6.6 | | 103993/33102 | 9.2 | | 104348/33215 | 9.5 | | 208341/66317 | 9.9 | | 312689/99532 | 10.5 | | 833719/265381 | 11.1 | | 1146408/364913 | 11.8 | |----------------+----------|
If you only want to know the number of correct decimal places, ignore the fractional parts of the numbers in the Decimals column above. For example, 22/7 gives two correct decimal places. But it almost gives three. (Technically, the Decimals column gives the -log

_{10}of the absolute error.) In case you're curious, here's a plot of the absolute errors on a log scale. RELATED POSTS * Rational approximations to _e_ * Five posts on computing pi * Continued fractions in Sage
Here's an interesting problem that came out of a logistic regression application. The input variable was between 0 and 1, and someone asked when and where the logistic transformation
_f_(_x_) = 1/(1 + exp(_a_ + _bx_))
has a FIXED POINT, i.e. _f_(_x_) = _x_.
So given logistic regression parameters _a_ and _b_, when does the logistic curve given by _y_ = _f_(_x_) cross the line _y_ = _x_? Do they always cross? Can they cross twice?
There's always at least one solution. Because _f_(_x_) is strictly between 0 and 1, the function
_g_(_x_) = _f_(_x_) - _x_
is positive at 0 and negative at 1, and by the intermediate value theorem g(_x_) must be zero for some _x_ between 0 and 1.
Sometimes _f_ has only one fixed point. It may have two or three fixed points, as demonstrated in the graph below. The case of two fixed points is unstable: the logistic curve is tangent to the line _y_ = _x_ at one point, and a tiny change would turn this tangent point into either no crossing or two crossings.
If |_b_| < 1, then you can show that the function _f_ is a CONTRACTION MAP on [0, 1]. In that case there is a unique solution to _f_(_x_) = _x_, and you can find it by starting with an arbitrary value for _x_ and repeatedly applying _f_ to it. For example, if _a_= 1 and _b_ = 0.8 and we start with _x_= 0, after applying _f_ ten times we get _x_ = _f_(_x_) = 0.233790157.
There are a couple questions left to finish this up. How can you tell from _a_ and _b_ how many fixed points there will be? The condition |_b_| < 1 is sufficient for _f_ to be a contraction map on [0, 1]. Can you find necessary and sufficient conditions?
RELATED POST: Sensitivity of logistic regression prediction

A new video from 3Blue1Brown is about visualizing derivatives as stretching and shrinking factors. Along the way they consider the function _f_(_x_) = 1 + 1/_x_.
Iterations of _f_ converge on the golden ratio, no matter where you start (with one exception). The video creates a graph where they connect values of _x_ on one line to values of _f_(_x_) on another line. Curiously, there's an oval that emerges where no lines cross.
Here's a little Python I wrote to play with this:
import matplotlib.pyplot as plt from numpy import linspace N = 70 x = linspace(-3, 3, N) y = 1 + 1/x for i in range(N): plt.plot([x[i], y[i]], [0, 1]) plt.xlim(-5, 5)
And here's the output:
In the plot above I just used matplotlib's default color sequence for each line. In the plot below, I used fewer lines (N = 50) and specified the color for each line. Also, I made a couple changes to the plot command, specifying the color for each line and putting the _x_ line above the _y_ line.
plt.plot([x[i], y[i]], [0, 1], c="#243f6a")
If you play around with the Python code, you probably want to keep N even. This prevents the _x_ array from containing zero.
UPDATE: Here's a variation that extends the lines connecting (_x_, 0) and (_y_, 1). And I make a few other changes while I'm at it.
N = 200 x = linspace(-10, 10, N) y = 1 + 1/x z = 2*y-x for i in range(N): plt.plot([x[i], z[i]], [0, 2], c="#243f6a") plt.xlim(-10, 10)

When I was a teenager, my uncle gave me a calculus book and told me that mastering calculus was the most important thing I could do for starting out in math. So I learned the basics of calculus from that book. Later I read Michael Spivak's two calculus books. I took courses that built on calculus, and studied generalizations of calculus such as calculus with complex variables, calculus in Banach spaces, etc. I taught calculus. After a while, I started to feel maybe I'd mastered calculus.
Last year I started digging into automatic differentiation and questioned whether I really had mastered calculus. At a high level, automatic differentiation is "just" an application of the chain rule in many variables. But you can make career out of exploring automatic differentiation (AD), and many people do. The basics of AD are not that complicated, but you can go down a deep rabbit hole exploring optimal ways to implement AD in various contexts.
You can make a career out of things that seem even simpler than AD. Thousands of people have made a career out of solving the equation _Ax_ = _b_ where _A_ is a large matrix and the vector _b_ is given. In high school you solve two equations in two unknowns, then three equations in three unknowns, and in principle you could keep going. But you don't solve a sparse system of millions of equations the same way. When you consider efficiency, accuracy, limitations of computer arithmetic, parallel computing, etc. the humble equation _Ax_ = _b_ is not so simple to solve.
As Richard Feynman said, nearly everything is really interesting if you go into it deeply enough.
RELATED POSTS:
* Curvature and automatic differentiation
* Iterative linear solvers
* Ten surprises from numerical linear algebra
* Solving systems of polynomial equations

The two-sample _t_-test is a way to test whether two data sets come from distributions with the same mean. I wrote a few days ago about how the test performs under ideal circumstances, as well as less than ideal circumstances.
This is an analogous post for testing whether two data sets come from distributions with the same variance. Statistics texts books often present the _F_-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both data sets come from normal distributions.
SENSITIVITY AND ROBUSTNESS
Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it's important to know how procedures perform when assumptions don't hold exactly. Since the _F_-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the _F_-test to perform terribly under simulation, relative to its recommended alternatives BARTLETT'S TEST and LEVENE'S TEST. That's not exactly what I found.
SIMULATION DESIGN
For my simulations I selected 35 samples from each of two distributions. I selected significance levels for the _F_-test, Bartlett's test, and Levene's test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.
I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the _F_-test. Then I used the same designs for data coming from a heavy-tailed distribution to see how well each of the tests performed.
For the normal null scenario, both data sets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario I used normal distributions with standard deviations 15 and 25.
NORMAL DISTRIBUTION CALIBRATION
Here are the results from the normal distribution simulations.
|----------+-------+--------+---------| | Test | Alpha | Type I | Type II | |----------+-------+--------+---------| | F | 0.13 | 0.0390 | 0.1863 | | Bartlett | 0.04 | 0.0396 | 0.1906 | | Levene | 0.06 | 0.0439 | 0.2607 | |----------+-------+--------+---------|
Here the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.
The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had larger Type II error than the other tests when calibrated to have the same Type I error.
To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.
HEAVY-TAIL SIMULATION RESULTS
Next I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both data sets were drawn from a Student _t_ distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.
|----------+-------+--------+---------| | Test | Alpha | Type I | Type II | |----------+-------+--------+---------| | F | 0.13 | 0.2417 | 0.2852 | | Bartlett | 0.04 | 0.2165 | 0.2859 | | Levene | 0.06 | 0.0448 | 0.4537 | |----------+-------+--------+---------|
The operating characteristics degraded when drawing samples from a heavy-tailed distribution, _t_ with 4 degrees of freedom, but they didn't degrade uniformly.
Compared to the _F_-test, the Bartlett test had slightly better Type I error and the same Type II error.
The Levene test, had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.
CONCLUSION: The _F_-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett's test doesn't seem much better in these particular scenarios. Levene's test, however, does perform better than the _F_-test, depending on the relative importance you place on Type I and Type II error.
RELATED POSTS
* Two-sample _t_-test and robustness
* Statistical consulting

To first approximation, Earth is a sphere. A more accurate description is that the earth is an oblate spheroid, the polar axis being a little shorter than the equatorial diameter. See details here. Other planets are also oblate spheroids as well. Jupiter is further from spherical than the earth is more oblate.
The general equation for the surface of an ellipsoid is
An ellipsoid has three semi-axes: _a_, _b_, and _c_. For a sphere, all three are equal to the radius. For a spheroid, two are equal. In an oblate spheroid, like Earth and Jupiter, _a_ = _b_ > _c_. For a prolate spheroid like Saturn's moon Enceladus, _a_ = _b_ < _c_.
The dwarf planet Haumea is far from spherical. It's not even a spheroid. It's a general (tri-axial) ellipsoid, with _a_, _b_, and _c_ being quite distinct. It's three semi-axes are approximately 1000, 750, and 500 kilometers. The dimensions are not known very accurately. The image above is an artists conception of Haumea and its ring, via Astronomy Picture of the Day.
This post explains how to compute the volume and surface area of an ellipsoid. See that post for details. Here we'll just copy the Python code from that post.
from math import sin, cos, acos, sqrt, pi from scipy.special import ellipkinc, ellipeinc def volume(a, b, c): return 4*pi*a*b*c/3 def area(a, b, c): phi = acos(c/a) k = a*sqrt(b**2 - c**2)/(b*sqrt(a**2 - c**2)) E = ellipeinc(phi, k) F = ellipkinc(phi, k) elliptic = E*sin(phi)**2 + F*cos(phi)**2 return 2.0*pi*c**2 + 2*pi*a*b*elliptic/sin(phi) a, b, c = 1000, 750, 500 print(volume(a, b, c)) print(area(a, b, c))
RELATED POST: Planets evenly spaced on a log scale

A TWO-SAMPLE _T_-TEST is intended to determine whether there's evidence that two samples have come from distributions with different means. The test assumes that both samples come from normal distributions.
ROBUST TO NON-NORMALITY, NOT TO ASYMMETRY
It is fairly well known that the _t_-test is robust to departures from a normal distribution, as long as the actual distribution is symmetric. That is, the test works more or less as advertised as long as the distribution is symmetric like a normal distribution, but it may not work as expected if the distribution is asymmetric.
This post will explore the robustness of the _t_-test via simulation. How far can you be from a normal distribution and still do OK? Can you have _any_ distribution as long as it's symmetric? Does a little asymmetry ruin everything? If something does go wrong, _how_ does it go wrong?
EXPERIMENT DESIGN
For the purposes of this post, we'll compare the null hypothesis that two groups both have mean 100 to the alternative hypothesis that one group has mean 100 and the other has mean 110. We'll assume both distributions have a standard deviation of 15. There is a variation on the two-sample _t_-test that does not assume equal variance, but for simplicity we will keep the variance the same in both groups.
We'll do a typical design, 0.05 significance and 0.80 power. That is, under the null hypothesis, that is if the two groups do come from the same distribution, we expect the test to wrongly conclude that the two distributions are different 5% of the time. And under the alternative hypothesis, when the groups do come from distributions with means 100 and 110, we expect the test to conclude that the two means are indeed different 80% of the time.
Under these assumptions, you'd need a sample of 36 in each group.
PYTHON SIMULATION CODE
Here's the code we'll use for our simulation.
from scipy.stats import norm, t, gamma, uniform, ttest_ind num_per_group = 36 def simulate_trials(num_trials, group1, group2): num_reject = 0 for _ in range(num_trials): a = group1.rvs(num_per_group) b = group2.rvs(num_per_group) stat, pvalue = ttest_ind(a, b) if pvalue < 0.05: num_reject += 1 return(num_reject)
NORMAL SIMULATION
Let's see how the two-sample _t_-test works under ideal conditions by simulating from the normal distributions that the method assumes.
First we simulate from the null, i.e. we draw the data for both groups from the same distribution
n1 = norm(100, 15) n2 = norm(100, 15) print( simulate_trials(1000, n1, n2) )
Out of 1000 experiment simulations, the method rejected the null 54 times, very close to the expected number of 50 based on 0.05 significance.
Next we simulate under the alternative, i.e. we draw data from different distributions.
n1 = norm(100, 15) n2 = norm(110, 15) print( simulate_trials(1000, n1, n2) )
This time the method rejected the null 804 times in 1000 simulations, very close to the expected 80% based on the power.
GAMMA SIMULATION
Next we draw data from a gamma distribution. First we draw both groups from a gamma distribution with mean 100 and standard deviation 15. This requires a shape parameter of 44.44 and a scale of 2.25.
g1 = gamma(a = 44.44, scale=2.25) g2 = gamma(a = 44.44, scale=2.25) print( simulate_trials(1000, g1, g2) )
Under the null simulation, we rejected the null 64 times out of 1000 simulations, higher than expected, but not that remarkable.
Under the alternative simulation, we pick the second gamma distribution to have mean 110 and standard deviation 15, which requires shape 53.77 and scale 2.025.
g1 = gamma(a = 44.44, scale=2.25) g2 = gamma(a = 53.77, scale=2.045) print( simulate_trials(1000, g1, g2) )
We rejected the null 805 times in 1000 simulations, in line with what you'd expect from the power.
When the gamma distribution has such large shape parameters, the distributions are approximately normal, though slightly asymmetric. To increase the asymmetry, we'll use a couple gamma distributions with smaller mean but shifted over to the right. Under the null, we create two gamma distributions with mean 10 and standard deviation 15, then shift them to the right by 90.
g1 = gamma(a = 6.67, scale=1.5, loc=90) g2 = gamma(a = 6.67, scale=1.5, loc=90) print( simulate_trials(1000, g1, g2) )
Under this simulation we reject the null 56 times out of 1000 simulations, in line with what you'd expect.
For the alternative simulation, we pick the second distribution to have a mean of 20 and standard deviation 15, and then shift it to the right by 90 so tht it has mean 110. This distribution has quite a long tail.
g1 = gamma(a = 6.67, scale=1.5, loc=90) g2 = gamma(a = 1.28, scale=11.25, loc=90) print( simulate_trials(1000, g1, g2) )
This time we rejected the null 499 times in 1000 simulations. This is a serious departure from what's expected. Under the alternative hypothesis, we reject the null 50% of the time rather than the 80% we'd expect. If higher mean is better, this means that half the time you'd fail to conclude that the better group really is better.
UNIFORM DISTRIBUTION SIMULATION
Next we use uniform random samples from the interval [74, 126]. This is a symmetric distribution with mean 100 and standard deviation 15. When we draw both groups from this distribution we rejected the null 48 times, in line with what we'd expect.
u1 = uniform(loc=74, scale=52) u2 = uniform(loc=74, scale=52) print( simulate_trials(1000, u1, u2) )
If we move the second distribution over by 10, drawing from [84, 136], we rejected the null 790 times out of 1000, again in line with what you'd get from a normal distribution.
u1 = uniform(loc=74, scale=52) u2 = uniform(loc=84, scale=52) print( simulate_trials(1000, u1, u2) )
In this case, we've made a big departure from normality and the test still worked as expected. But that's not always the case, as in the _t_(3) distribution below.
STUDENT-_T_ SIMULATION
Finally we simulate from a Student-_t_ distribution. This is a symmetric distribution, but heavier in the tails than the normal distribution.
First, we simulate from a _t_ distribution with 6 degrees of freedom and scale 12.25, making the standard deviation 15. We shift the location of the distribution by 100 to make the mean 100.
t1 = t(df=6, loc=100, scale=12.25) t2 = t(df=6, loc=100, scale=12.25) print( simulate_trials(1000, t1, t2) )
When both groups come from this distribution, we rejected the null 46 times. When we shifted the second distribution to have mean 110, we rejected the null 777 times out of 1000.
t1 = t(df=6, loc=100, scale=12.25) t2 = t(df=6, loc=110, scale=12.25) print( simulate_trials(1000, t1, t2) )
In both cases, the results are in line what we'd expect given a 5% significance level and 80% power.
A _t_ distribution with 6 degrees of freedom has a heavy tail compared to a normal. Let's try making the tail even heavier by using 3 degrees of freedom. We make the scale 15 to keep the standard deviation at 15.
t1 = t(df=3, loc=100, scale=15) t2 = t(df=3, loc=100, scale=15) print( simulate_trials(1000, t1, t2) )
When we draw both samples from the same distribution, we rejected the null 37 times out of 1000, less than the 50 times we'd expect.
t1 = t(df=3, loc=100, scale=15) t2 = t(df=3, loc=110, scale=15) print( simulate_trials(1000, t1, t2) )
When we shift the second distribution over to have mean 110, we rejected the null 463 times out of 1000, far less than the 80% rejection you'd expect if the data were normally distributed.
Just looking at a plot of the PDFs, a _t_(3) looks much closer to a normal distribution than a uniform distribution does. But the tail behavior is different. The tails of the uniform are as thin as you can get--they're zero!--while the _t_(3) has heavy tails.
These two examples show that you can replace the normal distribution with a moderately heavy tailed symmetric distribution, but don't overdo it. When the data some from a heavy tailed distribution, even one that is symmetric, the two-sample _t_-test may not have the operating characteristics you'd expect.

The latest episode of My Favorite theorem features John Urschel, former offensive lineman for the Baltimore Ravens and current math graduate student. His favorite theorem is a result on graph approximation: for every weighted graph, no matter how densely connected, it is possible to find a sparse graph whose Laplacian approximates that of the original graph. For details see Spectral Sparsification of Graphs by Dan Spielman and Shang-Hua Teng.
You could view this theorem as a positive result—it's possible to find good sparse approximations—or a negative result—the Laplacian doesn't capture very well how densely a graph is connected.
RELATED: Blog posts based on my interview with Dan Spielman at the 2014 Heidelberg Laureate Forum.
* Uses for orthogonal polynomials
* What is smoothed analysis?
* Studying algorithms to study problems

Here's an interesting little tidbit:

For any prime _p_ except 2 and 5, the decimal expansion of 1/_p_ repeats with a period that divides _p_-1.The period could be as large as _p_-1, but no larger. If it's less than _p_-1, then it's a divisor of _p_-1. Here are a few examples. 1/3 = 0.33… 1/7 = 0.142857142857… 1/11= 0.0909… 1/13 = 0.076923076923… 1/17 = 0.05882352941176470588235294117647… 1/19 = 0.052631578947368421052631578947368421… 1/23 = 0.04347826086956521739130434782608695652173913… Here's a table summarizing the periods above. |-------+--------| | prime | period | |-------+--------| | 3 | 1 | | 7 | 6 | | 11 | 2 | | 13 | 6 | | 17 | 16 | | 19 | 18 | | 23 | 22 | |-------+--------| For a proof of the claims above, and more general results, see Periods of Fractions.

This morning I ran across the following graph via Horace Dediu.
I developed Windows software during the fattest part of the Windows curve. That was a great time to be in the Windows ecosystem.
Before that I was in an academic bubble. My world consisted primarily of Macs and various flavors of Unix. I had no idea that my world was a tiny minority. I had a PC at home, but mostly used it as a terminal to connect to remote Unix machines, and didn't realize that Windows was so dominant.
When I found out I'd been very wrong in my perception of market share, I determined not to be naive again. Ever since then I regularly look around to keep an eye on the landscape.
The graph above combines desktop and mobile computers, and you may or may not think that's appropriate. Relative to the desktop market, Windows remains dominant, but the desktop market share itself has shrunk, not so much in absolute size but relative to total computer market.
Last time I looked, about 70% of the traffic to this web site comes from desktops. I still consider the desktop the place for "real work," and many people feel the same way.
It's conventional to say the Roman Empire fell in 476 AD, but this would have been a surprise to those living in the eastern half of the empire who considered themselves to be Roman. The eastern (Byzantine) empire continued for another thousand years after the western empire fell.
The Windows empire hasn't fallen, but has changed. Microsoft is doing well, as far as I know. I don't keep up with Microsoft as much as I used to. I have no animosity toward Microsoft, but I'm no longer doing the kind of work their tools are intended for. I still use Windows--I'm writing this blog post from my Windows 10 machine--though I also use Mac, Linux, iOS, and Android.

P. J. Huber gives three desiderata for a statistical method in his book Robust Statistics:
* It should have a reasonably good (optimal or nearly optimal) efficiency at the assumed model.
* It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly.
* Somewhat larger deviations from the model should not cause a catastrophe.
Robust statistical methods are those that strive for these properties. These are not objective properties _per se_ though there are objective ways to try to quantify robustness in various contexts.
(In Nassim Taleb's tricotomy of fragile, robust, and anti-fragile, robustness is in the middle. Something is fragile if it is harmed by volatility, and anti-fragile if it benefits from volatility. Robustness is in the middle. It is not unduly harmed by volatility, but does not benefit from it either. In the context of statistical methods, volatility is departure from modeling assumptions. Hardly anything benefits from modeling assumptions being violated, but some methods are harmed by such violations more or less than others.)
Here are several blog posts I've written about robustness:
* Efficiency vs robustness
* Mean vs median
* Canonical examples from robust statistics
* More theoretical power, less real power
* When it works, it works really well
* Robust priors
* Robust prior illustration
* Robustness of simple rules
* Measuring graph robustness

MATRIX COMPRESSION
Suppose you have an _m_ by _n_ matrix _A_, where _m_ and _n_ are very large, that you'd like to compress. That is, you'd like to come up with an approximation of _A_ that takes less data to describe.
For example, consider a high resolution photo that as a matrix of gray scale values. An approximation to the matrix could be a lower resolution representation that takes less space.
I heard a rumor many years ago that space probes would compress images by interpreting an image as a matrix and sending back a few eigenvalues and eigenvectors. That sounded fascinating, but what about images that aren't square? If a matrix _M_ is not square, then you can't have _Mx_ = λ_x_ for a scalar λ because the left and right sides of the equation would have different dimensions.
TRUNCATED SVD
Approximate a rectangular matrix requires using something more general than eigenvalues and eigenvectors, and that is singular values and singular vectors.
Suppose our matrix _A_ has the singular value decomposition
This can be written as
where the σ

_{_i_}are the singular values and _p_ is the number of singular values that are non-zero. The _u__{_i_}and _v__{_i_}are the _i_th columns of _U_ and _V_ respectively. Singular values are nonnegative, and listed in decreasing order. We can form an approximation to _A_ by truncating the sum above. Instead of taking _all_ the singular values, and their corresponding left and right singular vectors, we only take the _k_ largest singular values and their corresponding vectors. This is called the TRUNCATED SVD. We started by assuming _A_ had dimensions _m_ by _n_ and that both were large. Storing _A_ requires _mn_ numbers. Storing _A__{_k_}requires only _k_(1 + _m_ + _n_) numbers. This is a big savings if _k_ is much smaller than _m_ and _n_. ECKART-YOUNG THEOREM So how good an approximation is _A__{_k_}? Turns out it is optimal, in the least squares sense. This is the content of the Eckart-Young theorem. It says that the best least squares (2-norm) approximation of _A_ by a rank _k_ matrix is given by _A__{_k_}. Not only that, the theorem says the 2-norm error is given by the first singular value that we didn't use, i.e. RELATED POST Singular value decomposition and pseudoinverse
If often happens in applications that a linear system of equations _Ax_ = _b_ either does not have a solution or has infinitely many solutions. Applications often use LEAST SQUARES to create a problem that has a unique solution.
OVERDETERMINED SYSTEMS
Suppose the matrix _A_ has dimensions _m_ by _n_ and the right hand side vector _b_ has dimension _m_. Then the solution _x_, if it exists, has to have dimension _n_. If _m_ > _n_, i.e. we have more equations than unknowns, the system is OVERDETERMINED. The equations will be inconsistent in general. This is typical, for example, in linear regression.
In this case, you can use the least square criterion to determine a solution. Instead of demanding that _Ax_ = _b_, we look for an _x_ than makes the difference between _A__x_ and _b_ as small as possible, as measured by the 2-norm (root mean square). That is, we pick _x_ to minimize
meaning we solve the system as best we can, best as measured by the 2-norm.
UNDERDETERMINED SYSTEMS
If the number of rows in the matrix _A_, i.e. the number of equations, is less than the number of columns, i.e. the number of unknowns, then the system is UNDERDETERMINED. In general there were be infinitely many solutions. It's also possible that there are no solutions because the equations are inconsistent.
In this case, we can use least squares to assure that a solution exists, and to decide which of the many possible solutions to choose. Here we want to find the _x_ that minimizes
If there are values of _x_ that satisfy _Ax_ = _b_ then this will chose the solution with least norm.
LEAST SQUARES SOLUTIONS AND SVD
If the singular value decomposition (SVD) of a real matrix _A_ is given by
and _A_ has rank _r_, then the least squares solution to the system _Ax_ = _b_ is given explicitly by
where the _u_'s are the columns of _U_ and the _v_'s are the columns of _v_.
Note that the denominators are not zero; the fact that _A_ has rank _r_ means that it has _r_ positive singular values.
Furthermore, the least squares residual, i.e. the degree to which _Ax_ differs from _b_, is given by
Note that if the matrix _A_ has rank _m_, then the least squares problem can be solved exactly, and the right side above is an empty sum.
See, for example, Golub and Van Loan's classic book Matrix Computations.
RELATED POSTS
* Singular value decomposition and matrix pseudoinverse
* Don’t invert that matrix
* Applied linear algebra

In a nutshell, given the singular decomposition of a matrix _A_,
the Moore-Penrose pseudoinverse is given by
This post will explain what the terms above mean, and how to compute them in Python and in Matheamtica.
SINGULAR VALUE DECOMPOSITION (SVD)
The singular value decomposition of a matrix is a sort of change of coordinates that makes the matrix simple, a generalization of diagonalization.
MATRIX DIAGONALIZATION
If a square matrix _A_ is diagonalizable, then there is a matrix _P_ such that
where the matrix _D_ is diagonal. You could think of _P_ as a change of coordinates that makes the action of _A_ as simple as possible. The elements on the diagonal of _D_ are the eigenvalues of _A_ and the columns of _P_ are the corresponding eigenvectors.
Unfortunately not all matrices can be diagonalized. Singular value decomposition is a way to do something like diagonalization for _any_ matrix, even non-square matrices.
GENERALIZATION TO SVD
Singular value decomposition generalizes diagonalization. The matrix Σ in SVD is analogous to _D_ in diagonalization. Σ is diagonal, though it may not be square. The matrices on either side of Σ are analogous to the matrix _P_ in diagonalization, though now there are two different matrices, and they are not necessarily inverses of each other. The matrices _U_ and _V_ are square, but not necessarily of the same dimension.
The elements along the diagonal of Σ are not necessarily eigenvalues but SINGULAR VALUES, which are a generalization of eigenvalues. Similarly the columns of _U_ and _V_ are not necessarily eigenvectors but LEFT SINGULAR VECTORS and RIGHT SINGULAR VECTORS respectively.
The star superscript indicates CONJUGATE TRANSPOSE. If a matrix has all real components, then the conjugate transpose is just the transpose. But if the matrix has complex entries, you take the conjugate and transpose each entry.
The matrices _U_ and _V_ are UNITARY. A matrix _M_ is unitary if its inverse is its conjugate transpose, i.e. _M_

^{*}_M_ = _M__M_^{*}= _I_. PSEUDOINVERSE AND SVD The (Moore-Penrose) PSEUDOINVERSE of a matrix generalizes the notion of an inverse, somewhat like the way SVD generalized diagonalization. Not every matrix has an inverse, but every matrix has a pseudoinverse, even non-square matrices. Computing the pseudoinverse from the SVD is simple. If then where Σ^{+}is formed from Σ by taking the reciprocal of all the non-zero elements, leaving all the zeros alone, and making the matrix the right shape: if Σ is an _m_ by _n_ matrix, then Σ^{+}must be an _n_ by _m_ matrix. We'll give examples below in Mathematica and Python. COMPUTING SVD IN MATHEMATICA Let's start with the matrix _A_ below. We can find the SVD of _A_ with the following Mathematica commands. A = {{2, -1, 0}, {4, 3, -2}} {U, S, V} = SingularValueDecomposition[A] From this we learn that the singular value decomposition of _A_ is Note that the last matrix is not _V_ but the transpose of _V_. Mathematica returns _V_ itself, not its transpose. If we multiply the matrices back together we can verify that we get _A_ back. U . S. Transpose[V] This returns {{2, -1, 0}, {4, 3, -2}} as we'd expect. COMPUTING PSEUDOINVERSE IN MATHEMATICA The Mathematica command for computing the pseudoinverse is simply`PseudoInverse`

. (The best thing about Mathematica is it's consistent, predictable naming.)
PseudoInverse[A]
This returns
{{19/60, 1/12}, {-(11/30), 1/6}, {1/12, -(1/12)}}
And we can confirm that computing the pseudoinverse via the SVD
Sp = {{1/Sqrt[30], 0}, {0, 1/2}, {0, 0}} V . Sp . Transpose[U]
gives the same result.
COMPUTING SVD IN PYTHON
Next we compute the singular value decomposition in Python (NumPy).
>>> a = np.matrix([[2, -1, 0],[4,3,-2]]) >>> u, s, vt = np.linalg.svd(a, full_matrices=True)
Note that `np.linalg.svd`

returns the _transpose_ of _V_, not the _V_ in the definition of singular value decomposition.
Also, the object `s`

is not the diagonal matrix Σ but a vector containing only the diagonal elements, i.e. just the singular values. This can save a lot of space if the matrix is large. The NumPy method `svd`

has other efficiency-related options that I won't go into here.
We can verify that the SVD is correct by turning `s`

back into a matrix and multiply the components together.
>>> ss = np.matrix([[s[0], 0, 0], [0, s[1], 0]]) >>> u*ss*vt
This returns the matrix _A_, within floating point accuracy. Since Python is doing floating point computations, not symbolic calculation like Mathematica, the zero in _A_ turns into `-3.8e-16`

.
Note that the singular value decompositions as computed by Mathematica and Python differ in a few signs here and there; the SVD is not unique.
COMPUTING PSEUDOINVERSE IN PYTHON
The pseudoinverse can be computed in NumPy with `np.linalg.pinv`

.
>>> np.linalg.pinv(a) matrix([[ 0.31666667, 0.08333333], [-0.36666667, 0.16666667], [ 0.08333333, -0.08333333]])
This returns the same result as Mathematica above, up to floating point precision.
RELATED POSTS
* Pseudoinverse is not an adjoint
* Don't invert that matrix
* Applied linear algebra
The previous post looked at how probability predictions from a logistic regression model vary as a function of the fitted parameters. This post goes through the same exercise for PROBIT REGRESSION and compares the two kinds of nonlinear regression.
GENERALIZED LINEAR MODELS AND LINK FUNCTIONS
Logistic and probit regression are minor variations on a theme. They are both forms of GENERALIZED LINEAR MODELS (GLM), but with different LINK FUNCTIONS.
In logistic regression, the link function is
whereas in probit regression the link function is
where Φ is the standard normal distribution CDF, i.e.
The probability prediction from a generalized linear model is the inverse of the link function applied to a linear model. In logistic regression we had
and in probit regression we have
COMPARING LOGISTIC AND PROBIT CURVES
As a rule of thumb, probit regression coefficients are roughly equal to logisitic regression coefficients divided by 1.6. [1] To see this, here's a plot of the logistic curve from the previous post, which used _a_ = 3 and _b_ = 4, and a plot of the probit curve where _a_ = 3/1.6 and _b_ = 4/1.6.
And because the curves are so close together, we'll also plot their difference to see a little better how they differ.
If the curves are so similar, why do we have both? Because one may be easier to work with in some context and the other in other contexts.
SENSITIVITY TO PARAMETERS
In this post we'll look at how
varies with each of its parameters.
As with the logistic curve before, the probit curve has zero curvature when _x_ = -_a_/_b_, which corresponds to _p_ = 1/2.
For the logistic curve, the slope at _p_ = 1/2 was simply _b_, whereas here with the probit curve the slope at the flattest part is _b_/√(2π).
The rate of change of _p_ with respect to _a_ is
To find where this is maximizes we find where
is zero, which is again at _x_ = -_a_/_b_. By plugging this back in we find the maximum rate of change of _p_ with respect to _a_ is 1/√(2π).
As before, the dependence on _b_ is a little more complicated than the dependence on _a_. At the middle of the curve, i.e. when _p_ = 1/2, and so _x_ = -_a_/_b_, we have
The places where the sensitivity to _b_ are maximized are a little easier to find this time.
and so the places most sensitive to a change in _b_ can be found by solving a quadratic equation, and it works out that
RELATED POSTS
* Sensitivity of logistic regression on parameters
* Linear regression
[1] Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press 2007.

The output of a logistic regression model is a function that predicts the probability of an event as a function of the input parameter. This post will only look at a simple logistic regression model with one predictor, but similar analysis applies to multiple regression with several predictors.
Here's a plot of such a curve when _a_ = 3 and _b_ = 4.
FLATTEST PART
The curvature of the logistic curve is small at both extremes. As _x_ comes in from negative infinity, the curvature increases, then decreases to zero, then increases again, then decreases as _x_ goes to positive infinity. We quantified this statement in another post where we calculate the curvature. The curvature is zero at the point where the second derivative of _p_
is zero, which occurs when _x_ = -_a_/_b_. At that point _p_ = 1/2, so the curve is flattest where the probability crosses 1/2. In the graph above, this happens at _x_ = -0.75.
A little calculation shows that the slope at the flattest part of the logistic curve is simply _b_.
SENSITIVITY TO PARAMETERS
Now how much does the probability prediction _p_(_x_) change as the parameter _a_ changes? We now need to consider _p_ as a function of three variables, i.e. we need to consider _a_ and _b_ as additional variables. The marginal change in _p_ in response to a change in _a_ is the partial derivative of _p_ with respect to _a_.
To know where this is maximized with respect to _x_, we take the partial derivative of the above expression with respect to _x_
which is zero when _x_ = -_a_/_b_, the same place where the logistic curve is flattest. And the partial of _p_ with respect to _a_ at that point is simply 1/4, independent of _b_. So a small change Δ_a_ results in a change of approximately Δ_a_/4 at the flattest part of the logistic curve and results in less change elsewhere.
What about the dependence on _b_? That's more complicated. The rate of change of _p_ with respect to _b_ is
and this is maximized where
which in turn requires solving a nonlinear equation. This is easy to do numerically in a specific case, but not easy to work with analytically in general.
However, we can easily say how _p_ changes with _b_ near the point _x_ = -_a_/_b_. This is not where the partial of _p_ with respect to _b_ is maximized, but it's a place of interest because it has come up two times above. At that point the derivative of _p_ with respect to _b_ is -_a_/4_b_. So if _a_ and _b_ have the same sign, then a small increase in _b_ will result in a small decrease in _p_ and vice versa.

TRIDIAGONAL MATRICES
A tridiagonal matrix is a matrix that has nonzero entries only on the main diagonal and on the adjacent off-diagonals. This special structure comes up frequently in applications. For example, the finite difference numerical solution to the heat equation leads to a tridiagonal system. Another application, the one we'll look at in detail here, is NATURAL CUBIC SPLINES. And we'll mention an interesting result with Fibonacci numbers in passing.
Because these matrices have a special structure, the corresponding linear systems are quick and easy to solve. Also, we can find a simple recurrence relation for their determinants.
DETERMINANTS
Let's label the component of a tridiagonal matrix as below
where every component not shown is implicitly zero. We can expand the determinant of the matrix above using minors along the last row. This gives a recursive expression for the determinant
with initial conditions _d_

_{0}= 1 and _d__{-1}= 0. Note that if all the _a_'s and _b_'s are 1 and all the _c_'s are -1, then you get the recurrence relation that defines the Fibonacci numbers. That is, the Fibonacci numbers are given by the determinant NATURAL CUBIC SPLINES A cubic spline interpolates a set of data points with piecewise cubic polynomials. There's a (potentially) different cubic polynomial over each interval between input values, all fitted together so that the resulting function, its derivative, and its second derivative are all continuous. Suppose you have input points, called _knots_ in this context, _t__{0}, _t__{1}, … _t__{_n_}and output values _y__{0}, _y__{1}, … _y__{_n_}. For the spline to interpolate the data, its value at _t__{_i_}must be _y__{_i_}. A cubic spline then is a set of _n_ cubic polynomials, one for each interval [_t__{_i_}, _t__{_i_+1}]. A cubic polynomial has four coefficients, so we have 4_n_ coefficients in total. At each interior knot, _t__{1}through _t__{_n_-1}, we have four constraints. Both polynomials that meet at _t__{_i_}must take on the value _y__{_i_}at that point, and the two polynomials must have the same first and second derivative at that point. That gives us 4(_n_-1) equations. The value of the first polynomial is specified on the left end at _t__{0}the value of the last polynomial is specified at the right end at _t__{_n_}. This gives us 4_n_ - 2 equations in total. We need two more equations. A CLAMPED CUBIC SPLINE species the derivatives at each end point. The NATURAL CUBIC SPLINE specifies instead that the second derivatives at each end are zero. What is natural about a natural cubic spline? In a certain sense it is the smoothest curve interpolating the specified points. With these boundary conditions we now have as many constraints as degrees of freedom. So how would we go about finding the coefficients of each polynomial? Our task will be much easier if we parameterize the polynomials cleverly to start with. Instead of powers of _x_, we want powers of (_x_ - _t__{_i_}) and (_t__{_i_+1}- _x_) because these expressions are 0 on different ends of the interval [_t__{_i_}, _t__{_i_+1}]. It turns out we parameterize the spline over the _i_th interval as where _h__{_i_}= _t__{_i_+1}- _t__{_i_}, the length of the _i_th interval. This may seem unmotivated, and no doubt it is cleaner than the first thing someone probably tried, but it's the kind of thing you're lead to when you try to make the derivation work out smoothly. The basic form is powers of (_x_ - _t__{_i_}) and (_t__{_i_+1}- _x_), each to the first and third powers, for reasons given above. Why the 6's in the denominators? They're not strictly necessary, but they cancel out when we take second derivatives. Let's look at the second derivative. Note how when we stick in _t__{_i_}the first term is zero and the second is _z__{_i_}, and when we stick in _t__{_i_+1}the first term is _z__{_i_+1}and the second is zero. We can now write down the system of equations for the _z_'s. We have _z__{0}= _z__{_n_}from the natural cubic spline condition, and for 1 ≤ _i_ ≤ _n_-1 we have where Note that this is a tridiagonal system because the _i_th equation only involves _z_'s with subscripts _i_-1, _i_, and _i_+1. Because of its tridiagonal structure, these equations can be solved simply and efficiently, much more efficiently than a general system of equations. RELATED POSTS * Applied linear algebra * Chebyshev interpolation * Help with interpolation
The latest blog post from Gödel's Lost Letter and P=NP looks at the problem of finding relatively prime pairs of large numbers. In particular, they want a deterministic algorithm. They mention in passing that the probability of a set of _k_ large integers being relatively prime (coprime) is 1/ζ(_k_) where ζ is the Riemann zeta function. This blog post will look closer at that statement.
HOW LARGE IS LARGE?
What does it mean that the numbers are large? That they are large enough that the asymptotic result given by the zeta function is accurate enough. We'll explore this with simulation code shortly.
The exact statement is that in the limit as _N_ goes to infinity, the probability of _k_ integers chosen uniformly from 1 to _N_ being coprime converges to 1/ζ(_k_). I won't go into the proof here, but in a nutshell, it falls out of Euler's product formula for the zeta function.
WHAT DOES IT MEAN FOR A SET OF NUMBERS TO BE COPRIME?
The probability above seemed wrong to me at first. The function ζ(_k_) decreases as _k_ increases, and so its reciprocal increases. That means it's more likely that a set of numbers is coprime as the size of the set increases. But the larger the set of numbers, the more likely it is that two will have a common factor, so shouldn't the probability that they're coprime go down with _k_, not up?
The resolution to the apparent contradiction is that I had the wrong idea of coprime in mind. Two integers are coprime if they have no prime factors in common. How do you generalize that to more than two integers? You could define a set of numbers to be coprime if every pair of numbers in the set is coprime. That's the definition I had in mind. But that's not the standard definition. The right definition here is that the greatest common divisor of the set is 1. For example, (6, 14, 21) would be coprime because no single prime divides all three numbers, even though no pair of numbers from the set is coprime.
PYTHON SIMULATION
Let's write some Python code to try this out. We'll randomly generate some sets of large numbers and see how often they're coprime.
How do you generate large random integers in Python? You can use the function

`getrandbits`

to generate numbers as large as you like. In the code below we'll generate 128-bit integers.
How do you compute the greatest common divisor in Python? There's a library function `gcd`

, but it only takes two integers. We'll use the `reduce`

function to fold `gcd`

over a list of integers.
How do you compute the zeta function in Python? SciPy doesn't have an implementation of ζ(_x_) per se, but it does have an implementation of ζ(_x_)-1. Why not just implement ζ itself? Because for large _x_, ζ(_x_) is close to 1, so more precision can be saved by computing ζ - 1.
Putting these pieces together, here's code to randomly generate triples of 128-bit integers, see how often they're coprime, and compare the result to 1/ζ(3).
from random import getrandbits from fractions import gcd from functools import reduce from scipy.special import zetac def riemann_zeta(x): return zetac(x) + 1 k = 3 size = 128 N = 10000 coprime_count = 0 for _ in range(N): x = [getrandbits(size) for _ in range(k)] if reduce(gcd, x) == 1: coprime_count += 1 print(coprime_count/N) print(1/riemann_zeta(k))
When I ran this I got
0.8363 0.831907372581
Simulation agrees with theory to two decimal places, which is just what we should expect from 10,000 repetitions. (We'd expect error on the order of 1/√_N_.)
Here's what I got when I changed `k`

to 4:
0.9234 0.923938402922
And here's what I got for `k`

set to 5:
0.9632 0.964387340429
RELATED POSTS
* Functional folds and conjugate priors
* Probability that a number is prime
What could be interesting about the lowly quadratic formula? It's a _formula_ after all. You just stick numbers into it.
Well, there's an interesting wrinkle. When the linear coefficient _b_ is large relative to the other coefficients, the quadratic formula can give wrong results when implemented in floating point arithmetic.
QUADRATIC FORMULA AND LOSS OF PRECISION
The quadratic formula says that the roots of
are given by
That's true, but let's see what happens when we have _a_ = _c_ = 1 and _b_ = 10

^{8}. from math import sqrt def quadratic(a, b, c): r = sqrt(b**2 - 4*a*c) return ((-b + r)/(2*a), (-b -r)/(2*a)) print( quadratic(1, 1e8, 1) ) This returns (-7.450580596923828e-09, -100000000.0) The first root is wrong by about 25%, though the second one is correct. What happened? The quadratic equation violated the cardinal rule of numerical analysis: avoid subtracting nearly equal numbers. The more similar two numbers are, the more precision you can lose from subtracting them. In this case √(_b_² - 4_ac_) is very nearly equal to _b_. If we ask Python to evaluate 1e8 - sqrt(1e16-4) we get`1.49e-8`

when the correct answer would be `2.0e-8`

.
IMPROVING PRECISION FOR QUADRATIC FORMULA
The way to fix the problem is to rationalize the numerator of the quadratic formula by multiplying by 1 in the form
(The symbol ∓ is much less common than ±. It must means that if you take the the + sign in the quadratic formula, take the - sign above, and vice versa.)
When we multiply by the expression above and simplify we get
Let's code this up in Python and try it out.
def quadratic2(a, b, c): r = sqrt(b**2 - 4*a*c) return (2*c/(-b - r), 2*c/(-b+r)) print( quadratic2(1, 1e8, 1) )
This returns
(-1e-08, -134217728.0)
So is our new quadratic equation better? It gives the right answer for the first root, exact to within machine precision. But now the second root is wrong by 34%. Why is the second root wrong? Same reason as before: we subtracted two nearly equal numbers!
The familiar version of the quadratic formula computes the larger root correctly, and the new version computes the smaller root correctly. Neither version is better overall. We'd be no better off or worse off always using the new quadratic formula than the old one. Each one is better when it avoids subtracting nearly equal numbers.
The solution is to use _both_ quadratic formulas, using the appropriate one for the root you're trying to calculate.
LOW PRECISION ARITHMETIC
Is this a practical concern? Yes, and here's why: Everything old is new again.
The possible inaccuracy in the quadratic formula was serious before double precision (64-bit floating point) arithmetic became common. And back in the day, engineers were more likely to be familiar with the alternate form of the quadratic formula. You can still run into quadratic equations that give you trouble even in double precision arithmetic, like the example above, but it's less likely to be a problem when you have more bits at your disposal.
Now we're interested in low-precision arithmetic again. CPUs have gotten much faster, but moving bits around in memory has not. Relative to CPU speed, memory manipulation has gotten slower. That means we need to be more concerned with memory management and less concerned about floating point speed.
Not only is memory juggling slower relative to CPU, it also takes more energy. According to Gustafson, reading 64 bits from DRAM requires 65 times as much energy as doing a floating point combined multiply-add because it takes place off-chip. The table below, from page 6 of Gustafson's book, gives the details. Using lower precision floating point saves energy because more numbers can be read in from memory in the same number of bits. (Here pJ = picojoules.)
Operation
Energy consumed
Where
Perform a 64-bit floating point multiply-add
64 pJ
on-chip
Load or store 64 bits of register data
6 pJ
on-chip
Read 64 bits from DRAM
4200 pJ
off-chip
So we might be interested in low-precision arithmetic to save energy in a battery powered mobile device, or to save clock cycles in a server application manipulating a lot of data. This means that the numerical tricks that most people have forgotten about are relevant again.
RELATED POSTS
* Eight-bit floating point
* Math library functions that seem unnecessary
There was a discussion on Twitter today about a mistake calculus students make:
I pointed out that it's only off by one character:
The first equation is simply wrong. The second is correct, but a gross violation of convention, using _x_ as a constant and _e_ as a variable.

I've started a new Twitter account: @BasicStatistics.
The new account is for people who are curious about statistics. It's meant to be accessible to a wider audience than @DataSciFact.
More Twitter accounts here.

Bernstein's Matrix Mathematics is impressive. It's over 1500 pages and weighs 5.3 pounds (2.4 kg). It's a reference book, not the kind of book you just sit down to read. (Actually, I have sat down to read parts of it.) I'd used a library copy of the first edition, and so when Princeton University Press offered me a review copy of the second edition, I jumped on it.
Matrix Mathematics has a _lot_ of information on linear algebra. As you'd expect from the title, it's mostly about linear algebra. And despite it enormous size, it's also dense. Mostly definitions and theorem statements. Some examples and proofs, but mostly statements of facts.
But there are a lot of other topics covered too. For example, I was surprised to see a section on Bell polynomials, a topic I ran across in my work and blogged about not long ago.
Why even have reference books these days when you can easily find so much online? For one thing, there's still a lot you can't easily find online. When you go beyond commonly known material, as this book does, it gets hard to search for what you need.
For another, an author goes to tremendous effort to arrange the information coherently. When you read a _book_ you find things you didn't know to search for. Maybe you start by looking up what you think you need to know in the index, but then you find out from context what you really needed to know.
I'm glad to add this to the books I keep close at hand. I can find what I need quickly, and that's more important than it may seem. If I save a couple minutes, the benefit is not just that I get a couple more minutes work done. The main benefit is that I increase my chances of acting on an inspiration before it evaporates.

The MOORE-PENROSE PSEUDOINVERSE of a matrix is a way of coming up with something like an inverse for a matrix that doesn't have an inverse. If a matrix does have an inverse, then the pseudoinverse is in fact the inverse. The Moore-Penrose pseudoinverse is also called a generalized inverse for this reason: it's not just _like_ an inverse, it actually _is_ an inverse when that's possible.
Given an _m_ by _n_ matrix _A_, the Moore-Penrose pseudoinverse _A_

^{+}is the unique _n_ by _m_ matrix satisfying four conditions: * _A_ _A_^{+}_A_ = _A_ * _A_^{+}_A_ _A_^{+}= _A_^{+}* (_A_ _A_^{+})* = _A_ _A_^{+}* (_A_^{+}_A_)* = _A_^{+}_A_ The first equation says that _A__A_^{+}is a left identity for _A_, and _A_^{+}_A_ is a identity for _A_. The second equation says _A_^{+}_A_ is a left identity for _A_^{+}, and _A_ _A_^{+}is a right identity for _A_^{+}. The third and fourth equations say that _A_ _A_^{+}and _A_^{+}_A_ are Hermitian. If _A_ is invertible, _A_ _A_^{+}and _A_^{+}_A_ are both the identity matrix. Otherwise _A_ _A_^{+}and _A_^{+}_A_ act an awful lot like the identity, as much as you could expect, maybe a little more than you'd expect. UPDATE: See this post for the relationship between the singular value decomposition and pseudoinverses, and how to compute both in Python and Mathematica. GALOIS CONNECTIONS AND ADJOINTS John Baez recently wrote that a GALOIS CONNECTION, a kind of categorical ADJUNCTION, is"the best approximation to reversing a computation that can't be reversed."That sounds like a pseudoinverse! And the first two equations defining a pseudoinverse look a lot like things you'll see in the context of adjunctions, so the pseudoinverse must be an adjunction, right? The question was raised on MathOverflow and Michal R. Przybylek answered

I do not think the concept of Moore-Penrose Inverse and the concept of categorical adjunction have much in common (except they BOTH TRY TO GENERALISE THE CONCEPT OF INVERSE) …and gives several reasons why. (Emphasis added.) Too bad. It would have made a good connection. Applied mathematicians are likely to be familiar with Moore-Penrose pseudoinverses but not categorical adjoints. And pure mathematicians, depending on their interests, may be more familiar with adjoint functors than matrix pseudoinverses. So what about John Baez' comment? His comment was expository (and very helpful) but not meant to be rigorous. To make it rigorous you'd have to be rigorous about what you mean by "best approximation" etc. And when you define your terms carefully, in the language of category theory, you get adjoints. This means that the Moore-Penrose inverse, despite its many nice properties [1], doesn't mesh well with categorical definitions. It's not the best approximate inverse from a categorical perspective because it doesn't compose well, and category theory values composition above all else. The Moore-Penrose pseudoinverse may be the best approximate inverse from some perspectives, but not from a categorical perspective. Przybylek explains

… adjunctions _compose_ … but Moore-Penrose pseudoinverses—generally—do not. … pseudoinverses are not stable under _isomorphisms_, thus are not _categorical_.That's the gist of his final point. Now let me fill in and expand slightly part of what I cut out.

If _f_: _A_ → _B_ is left adjoint to _f_This turns out to be an interesting example, but not of what I first expected. Rather than the pseudoinverse of a matrix being an example of an adjoint, it is an example of something that despite having convenient properties does not compose well from a categorical perspective. RELATED POSTS: * What do you mean by "can't"? * How to differentiate a non-differentiable function * Approximating a solution that doesn't exist * Applied category theory [1] The book Matrix Mathematics devotes about 40 pages to stating theorems about the Moore-Penrose pseudoinverse.^{+}: _B_ → _A_ and _g_: _B_ → _C_ is left adjoint to _g_^{+}: _C_ → _B_ then the composition _gf_: _A_ → _C_ is left adjoint to the composition _f_^{+}_g_^{+}: C → A, but Moore-Penrose pseudoinverses do not compose this way in general.

One of my complaints about math writing is that definitions are hardly ever subtractive, even if that's how people think of them.
For example, a monoid is a group except without inverses. But that's not how you'll see it defined. Instead you'll read that it's a set with an associative binary operation and an identity element. A module is a vector space, except the scalars come from a ring instead of a field. But the definition from scratch is more than I want to write here. Any time you have a sub-widget or a pre-widget or a semi-widget, it's probably best to define the widget first.
I understand the LOGICAL tidiness of saying what a thing _is_ rather than what it is _not_. But it makes more PEDAGOGICAL sense to describe the difference between a new concept and the most similar familiar concept. And the nearest familiar concept may have more structure rather than less.
Suppose you wanted to describe where a smaller city is by giving directions from larger, presumably more well known city, but you could only move east. Then instead of saying Ft. Worth is 30 miles west of Dallas, you'd have to say it's 1,000 miles east of Phoenix.
Writers don't have to chose between crisp logic and good pedagogy. They can do both. They can say, for example, that a pre-thingy is a thingy without some property, then say "That is, a pre-thingy satisfies the following axioms: …"

A probability distribution is called "fat tailed" if its probability density goes to zero slowly. Slowly relative to what? That is often implicit and left up to context, but generally speaking the exponential distribution is the dividing line. Probability densities that decay faster than the exponential distribution are called "thin" or "light," and densities that decay slower are called "thick", "heavy," or "fat," or more technically "subexponential." The distinction is important because fat-tailed distributions tend to defy our intuition.
One surprising property of heavy-tailed (subexponential) distributions is the single big jump principle. Roughly speaking, most of the contribution to the sum of several heavy-tailed random variables comes from the largest of the samples. To be more specific, let "several" = 4 for reasons that'll be apparent soon, though the result is true for any _n_. As _x_ goes to infinity, the probability that
_X_

_{1}+ _X__{2}+ _X__{3}+ _X__{4}is larger than _x_ is asymptotically equal the probability that max(_X__{1}, _X__{2}, _X__{3}, _X__{4}) is larger than _x_. The idea behind the OBESITY INDEX [1] is turn the theorem above around, making it an empirical measure of how thick a distribution's tail is. If you draw four samples from a random variable and sort them, the obesity index is the probability that that the sum of the max and min, _X__{1}+ _X__{4}, is greater than the sum of the middle samples, _X__{2}+ _X__{3}. The obesity index could be defined for any distribution, but it only measures what the name implies for right-tailed distributions. For any symmetric distribution, the obesity index is 1/2. A Cauchy distribution is heavy-tailed, but it has two equally heavy tails, and so its obesity index is the same as the normal distribution, which has two light tails. Note that location and scale parameters have no effect on the obesity index; shifting and scaling effect all the _X_ values the same, so it doesn't change the probability that _X__{1}+ _X__{4}is greater than _X__{2}+ _X__{3}. To get an idea of the obesity index in action, we'll look at the normal, exponential, and Cauchy distributions, since these are the canonical examples of thin, medium, and thick tailed distributions. But for reasons explained above, we'll actually look at the _folded_ normal and _folded_ Cauchy distributions, i.e. we'll take their absolute values to create right-tailed distributions. To calculate the obesity index exactly you'd need to do analytical calculations with order statistics. We'll simulate the obesity index because that's easier. It's also more in the spirit of calculating the obesity index from data. from scipy.stats import norm, expon, cauchy def simulate_obesity(dist, N): data = abs(dist.rvs(size=(N,4))) count = 0 for row in range(N): X = sorted(data[row]) if X[0] + X[3] > X[1] + X[2]: count += 1 return count/N for dist in [norm, expon, cauchy]: print( simulate_obesity(dist, 10000) ) When I ran the Python code above, I got 0.6692 0.7519 0.8396 This ranks the three distributions in the anticipated order of tail thickness. Note that the code above takes the absolute value of the random samples. This lets us pass in ordinary (unfolded) versions of the normal and Cauchy distributions, and its redundant for any distribution like the exponential that's already positive-valued. [I found out after writing this blog post that SciPy now has`foldnorm`

and `foldcauchy`

, but they don't seem to work like I expect.]
Let's try it on a few more distributions. Lognormal is between exponential and Cauchy in thickness. A Pareto distribution with parameter _b_ goes to zero like _x_^{-1-_b_}and so we expect a Pareto distribution to have a smaller obesity index than Cauchy when _b_ is greater than 1, and a larger index when _b_ is less than one. Once again the simulation results are what we'd expect. The code for dist in [lognorm, pareto(2), pareto(0.5)]: print( simulate_obesity(dist, 10000) ) returns 0.7766 0.8242 0.9249 By this measure, lognormal is just a little heavier than exponential. Pareto(2) comes in lighter than Cauchy, but not by much, and Pareto(0.5) comes in heavier. Since the obesity index is a probability, it will always return a value between 0 and 1. Maybe it would be easier to interpret if we did something like take the logit transform of the index to spread the values out more. Then the distinctions between Pareto distributions of different orders, for example, might match intuition better. [1] Roger M. Cooke et al. Fat-Tailed Distributions: Data, Diagnostics and Dependence. Wiley, 2014.
The Duffing equation is an ordinary differential equation describing a nonlinear damped driven oscillator.
If the parameter μ were zero, this would be a damped driven linear oscillator. It's the nonlinear _x_³ term that makes things nonlinear and interesting.
Using an analog computer in 1961, Youshisuke Ueda discovered that this system was chaotic. It was the first discovery of chaos in a second-order non-autonomous periodic system. Lorenz discovered his famous Lorenz attractor around the same time, though his system is third order.
Ueda's system has been called "Ueda's strange attractor" or the "Japanese attractor."
In the linear case, when μ = 0, the phase portrait simply spirals outward from the origin towards its steady state.
But things get much more interesting when μ = 1. Here's Mathematica code to numerically solve the differential equation and plot the phase portrait.
duff[t_] = NDSolve[{x[t] + 0.05 x[t] + x[t] + x[t]^3 == 7.5 Cos[t], x[0] == 0, x[0] == 1}, x[t], {t, 0, 100}][[1, 1, 2]] ParametricPlot[{duff[t], duff[t]}, {t, 0, 50}, AspectRatio -> 1]
Here's the same system integrated out further, to _t_ = 200 rather 50.
Source: Wanda Szemplińska-Stupnicka. Chaos, Bifurcations and Fractals Around Us.

The first post in this series looked at a possible FORMULA for the shape of an egg, HOW TO FIT the parameters of the formula, and the CURVATURE of the shape at each end of the egg.
The second post looked at the VOLUME. This post looks at the SURFACE AREA.
If you rotate the graph of a function _f_(_x_) around the _x_-axis between _c_ and _d_, the area of the resulting surface is
This integral cannot be computed in closed form for the function _f_ describing an egg. (At least I can't find a closed form, and neither can Mathematica.) But we can do it with numerical integration.
Here's the Mathematica code.
f[x_, a_, b_, k_] := b Sqrt[(1 - x^2/a^2) / (1 + x k)] area[a_, b_, k_] := 2 Pi* NIntegrate[ f[x, a, b, k] Sqrt[1 + D[f[x, a, b, k], x]^2], {x, -a, a} ]
As a sanity check, let's verify that if our egg were spherical we would get back the area of that sphere.

`area[3, 3, 0]`

returns 113.097 and `N[36 Pi]`

also returns 113.097, so that's a good sign.
Now let's plot the surface area as a function of the parameter _k_.
Plot[area[4, 2, k], {k, -0.2, 0.2}]
The _y_-axis starts at 85.9, so the plot above exaggerates the effect of _k_. Here's another plot with the _y_-axis starting at zero.
Plot[g[4, 2, k], {k, -0.2, 0.2}, PlotRange -> {0, 100}]
As with volume, the difference between an egg and an ellipsoid is approximately a quadratic function of the parameter _k_.