Loading

- Runge phenomena
- Twenty questions and conditional probability
- Handedness, introversion, height, blood type, and PII
- Pareto distribution and Benford’s law
- Random number generation posts
- Quantifying information gain in beta-binomial Bayesian model
- Big aggregate queries can still violate privacy
- Visualizing complex functions
- Why is Kullback-Leibler divergence not a distance?
- Wheels about to be reinvented
- Chebyshev interpolation
- Fourier-Bessel series and Gibbs phenomena
- Animated exponential sum
- Database anonymization for testing
- Recent exponential sums
- Yogi Berra meets Pafnuty Chebyshev
- The most disliked programming language
- Poisson distribution and prime numbers
- US Counties and power laws
- Paul Klee meets Perry the Platypus
- Differential equations and recurrence relations
- Finding numbers in pi
- Empirically testing the Chowla conjecture
- Time series analysis vs DSP terminology
- How to eliminate the first order term from a second order ODE
- Common words that have a technical meaning in math
- A uniformly distributed sequence
- Applying probability to non-random things
- Misplacing a continent
- Quaint supercomputers

I've mentioned the Runge phenomenon in a couple posts before. Here I'm going to go into a little more detail.
First of all, the "Runge" here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with _cowabunga_, not with _sponge_.
Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function _f_(_x_) = 1/(1 + _x_²) on the interval [-5, 5], or equivalently, and more convenient here, the function _f_(_x_) = 1/(1 + 25_x_²) on the interval [-1, 1]. Here's an example with 16 interpolation nodes.
Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255] on the imaginary axis. For more details, see [2].
The function in Runge's example has a singularity at 0.2_i_, which is inside the football. Linear interpolation at evenly spaced points would converge for the function _f_(_x_) = 1/(1 + _x_²) since the singularity at _i_ is outside the football.
For another example, consider the function _f_(_x_) = exp(- 1/_x_²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there's a small indication of trouble at the ends.
With 28 interpolation points in the plot below, the lack of convergence is clear.
The problem is not polynomial interpolation _per se_ but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the _rate_ of convergence but not whether the interpolants converge.
RELATED: Help with interpolation
***
[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.
[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

The previous post compared bits of information to answers in a game of Twenty Questions.
The optimal strategy for playing Twenty Questions is for each question to split the remaining possibilities in half. There are a couple ways to justify this strategy: mixmax and average.
The MINMAX approach is to minimize the worse thing that could happen. The worst thing that could happen after asking a question is for your subject to be in the larger half of possibilities. For example, asking whether someone is left-handed is not a good strategy: the worst case scenario is "no," in which case you've only narrowed your possibilities by 10%. The best mixmax approach is to split the population in half with each question.
The best AVERAGE approach is also to split the possibilities in half each time. With the handedness example, you learn more if the answer is "yes," but there's only a 10% change that the answer is yes. There's a 10% chance of gaining 3.322 bits of information, but a 90% chance of only gaining 0.152 bits. So the expected number of bits, the entropy, is 0.469 bits. ENTROPY IS MAXIMIZED WHEN ALL OUTCOMES ARE EQUALLY LIKELY. That means you can't learn more than 1 bit of information on average from a yes/no question, and you learn the most when both possibilities are equally likely.
Now suppose you want to ask about height and sex. As in the previous post, we assume men and women's heights are normally distributed with means 70 and 64 inches respectively, and both have standard deviation 3 inches.
If you ask whether a person is taller than 67 inches, you split a mixed population of men and women in half. You will learn 1 bit of information from this question, but you've put yourself in a suboptimal position for the next question. A height of at least 67 inches half of the adult population in general, but it selects a majority of men and a minority of women. And as we discussed above, uneven splits are suboptimal, in the worst case and on average.
If you're going to ask about height and sex, ask about sex first. If the person is female, ask next whether her height is above 64 inches. But if the person is male, ask whether his height is above 70 inches. That is, you want to split the population evenly at each step CONDITIONING ON YOUR PREVIOUS ANSWER. A cutoff of 67 inches is optimal unconditionally, but suboptimal if you condition on sex.
The optimal strategy for Twenty Questions is to ask a question with probability of 1/2 being true, _conditional on all previous data_. You might get lucky with uneven conditional splits, but on average, and in the worst case, you won't do as well.

I've had data privacy on my mind a lot lately because I've been doing some consulting projects in that arena.
When I saw a tweet from Tim Hopper a little while ago, my first thought was "How many bits of PII is that?". [1]

π Things Only Left Handed Introverts Over 6′ 5″ with O+ Blood Type Will Appreciate — Tim Hopper (@tdhopper) November 16, 2014Let's see. There's some small correlation between these characteristics, but let's say they're independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we'll let that pass. This is just back-of-the-envelope reckoning.) About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log

_{2}(0.1) = 3.3 bits of information. I don't know how many people identify as introverts. I believe I'm a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick "introvert" or "extrovert," maybe half each. So we've got about one bit of information from knowing someone is an introvert. The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.) What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we're headed that way, but as noted above, this is just a quick and dirty calculation. Heights in general don't follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information. All told, there are 16.7 bits of information in the tweet above, as much information as you'd get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative. *** [1] PII = Personally Identifiable Information [2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn't be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10^{-15}. This is an example of where the normal distribution assumption breaks down in the extremes.
The Pareto probability distribution has density
for _x_ ≥ 1 where _a_ > 0 is a shape parameter. The Pareto distribution and the Pareto principle (i.e. "80-20" rule) are named after the same person, the Italian economist Vilfredo Pareto.
Samples from a Pareto distribution obey Benford's law in the limit as the parameter _a_ goes to zero. That is, the smaller the parameter _a_, the more closely the distribution of the first digits of the samples come to following the distribution known as Benford's law.
Here's an illustration of this comparing the distribution of 1,000 random samples from a Pareto distribution with shape _a_ = 1 and shape _a_ = 0.2 with the counts expected under Benford's law.
Note that this has nothing to do with base 10 per se. If we look at the leading digits as expressed in any other base, such as base 16 below, we see the same pattern.
MORE POSTS ON BENFORD'S LAW
* Weibull distribution and Benford's law
* Benford's law, chi-square, and factorials
* Benford's law and SciPy constants

**More posts on Pareto*** Why we don't follow the 80-20 rule * The Lindy Effect
Random number generation is typically a two step process: first generate a uniformly distributed value, then transform that value to have the desired distribution. The former is the hard part, but also the part more likely to have been done for you in a library. The latter is relatively easy in principle, though some distributions are hard to (efficiently) sample from.
Here are some posts on testing a uniform RNG.
* Testing RNGs with PractRand
* Testing PCG
* Simple RNG and NIST
Here's a book chapter I wrote on testing the transformation of a uniform RNG into some other distribution.
* Testing a random number generator
A few posts on manipulating a random number generator.
* Manipulating a random number generator
* Reverse engineering the seed of an LCG
* Predicting when an RNG will output a given value
And finally, a post on a cryptographically secure random number generator.
* Blum Blum Shub

The beta-binomial model is the "hello world" example of Bayesian statistics. I would call it a toy model, except it is actually useful. It's not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it's a conjugate model, the calculations work out trivially.
For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.
I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.
Here's a little Python code for computing this. Enter the _a_ and _b_ parameters of the prior and the posterior to compute how much information was gained.
from scipy.integrate import quad from scipy.stats import beta as beta from scipy import log2 def infogain(post_a, post_b, prior_a, prior_b): p = beta(post_a, post_b).pdf q = beta(prior_a, prior_b).pdf (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1) return info
This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine

`quad`

needs some help when the beta distributions become more concentrated.
You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.
print( infogain(4, 7, 3, 7) ) print( infogain(3, 8, 3, 7) )
The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.
Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies.
* Data scientists can only query aggregate statistics, such as counts and averages.
* These aggregate statistics must be based on results that return at least 1,000 database rows.
This sounds good, but it's naive. It's not enough to protect customer privacy.
Someone wants to know how much money customer 123456789 makes. If he asks for this person's income, the query would be blocked by the first rule. If he asks for the _average_ income of customers with ID 123456789 then he gets past the first rule but not the second.
He decides to test whether the customer in question makes a six-figure income. So first he queries for the number of customers with income over $100,000. This is a count so it gets past the firs rule. The result turns out to be 14,254, so it gets past the second rule as well. Now he asks how many customers with ID _not equal to_ 123456789 have income over $100,000. This is a valid query as well, and returns 14,253. So by executing only queries that return aggregate statistics on thousands of rows, he found out that customer 123456789 has at least a six-figure income.
Now he goes back and asks for the average income of customers with income over $100,000. Then he asks for the average income of customers with income over $100,000 and with ID not equal to 123456789. With a little algebra, he's able to find customer 123456789's exact income.
You might object that it's cheating to have a clause such as "ID not equal 123456789" in a query. Of course it's cheating. It clearly violates the spirit of the law, but not the letter. You might try to patch the rules by saying you cannot ask questions about a small set, nor about the complement of a small set. [1]
That doesn't work either. Someone could run queries on customers with ID less than or equal to 123456789 and on customers with ID greater than or equal to 123456789. Both these sets and their complements may be large but they let you find out information on an individual.
You may be asking why let a data scientist have access to customer IDs at all. Obviously you wouldn't do that if you wanted to protect customer privacy. The point of this example is that limiting queries to aggregates of large sets is not enough. You can find out information on small sets from information on large sets. This could still be a problem with obvious identifiers removed.
RELATED:
* Privacy consulting: Anonymized data, statistical databases
* HIPAA deidentification
***
[1] Readers familiar with measure theory might sense a σ-algebra lurking in the background.

It's easy to visualize function from two real variables to one real variable: Use the function value as the height of a surface over its input value. But what if you have one more dimension in the output? A complex function of a complex variable is equivalent to a function from two real variables to two real variables. You can use one of the output variables as height, but what do you do about the other one?
ANNOTATING PHASE ON 3D PLOTS
You can start by expressing the function output _f_(_z_) in polar form
_f_(_z_) = _r__e_

^{_i_θ}Then you could plot the magnitude _r_ as height and write the phase angle θ on the graph. Here's a famous example of that approach from the cover of Abramowitz and Stegun. You can find more on that book and the function it plots here. This approach makes it easy to visualize the magnitude, but the phase is textual rather than visual. A more visual approach is to use _color_ to represent the phase, such as hue in HSV scale. HUE-ONLY PHASE PLOTS You can combine color with height, but sometimes you'll just use color alone. That is the approach taken in the book Visual Complex Functions. This is more useful than it may seem at first because the magnitude and phase are tightly coupled for an analytic function. The phase alone uniquely determines the function, up to a constant multiple. The image I posted the other day, the one I thought looked like Paul Klee meets Perry the Platypus, is an example of a phase-only plot. If I'd added more contour lines the plot would be more informative, but it would look less like a Paul Klee painting and less like a cartoon platypus, so I stuck with the defaults. Mathematica has dozens of color schemes for phase plots. I stumbled on "StarryNightColors" and liked it. I imagine I wouldn't have seen the connection to Perry the Playtpus in a different color scheme. USING HUE AND BRIGHTNESS You can visualize magnitude as well as phase if you add another dimension to color. That's what Daniel Velleman did in a paper that I read recently [1]. He uses hue to represent angle and brightness to represent magnitude. I liked his approach partly on aesthetic grounds. Phase plots using hue only tend to look psychedelic and garish. The varying brightness makes the plots more attractive. I'll give some examples then include Velleman's code. First, here's a plot of Jacobi's sn function [2], an elliptic function analogous to sine. I picked it because it has a zero and a pole. Zeros show up as black and poles as white. (The function is elliptic, so it's periodic horizontally and vertically. Functions like sine are periodic horizontally but not vertically.) You can see the poles on the left and right and the zero in the middle. Note that the hues swirl in opposite directions around zeros and poles: ROYGBIV counterclockwise around a zero and clockwise around a pole. Next, here's a plot of the 5th Chebyshev polynomial. I chose this because I've been thinking about Chebyshev polynomials lately, and because polynomials make plots that fade to white in every direction. (That is, |_p_(_z_)| → ∞ as |_z_| → ∞ for all polynomials.) Finally, here's a plot of the gamma function. I included this example because you can see the poles as little white dots on the left, and the plot has a nice dark-to-light overall pattern. MATHEMATICA CODE Here's the Mathematica code from Velleman's paper. Note that in the HSV scale, he uses brightness to change both the saturation (S) and value (V). Unfortunately the function _f_ being plotted is a global rather than being passed in as an argument to the plotting function. brightness[x_] := If[x <= 1, Sqrt[x]/2, 1 - 8/((x + 3)^2)] ComplexColor[z_] := If[z == 0, Hue[0, 1, 0], Hue[Arg[z]/(2 Pi), (1 - (b = brightness[Abs[z]])^4)^0.25, b]] ComplexPlot[xmin_, ymin_, xmax_, ymax_, xsteps_, ysteps_] := Block[ {deltax = N[(xmax - xmin)/xsteps], deltay = N[(ymax - ymin)/ysteps]}, Graphics[ Raster[ Table[ f[x + I y], {y, ymin + deltay/2, ymax, deltay}, {x, xmin + deltax/2, xmax, deltax}], {{xmin, ymin}, {xmax, ymax}}, ColorFunction -> ComplexColor ] ] ] RELATED POSTS * Fractal-like phase plots * Applied complex analysis FOOTNOTES [1] Daniel J. Velleman. The Fundamental Theorem of Algebra: A Visual Approach. Mathematical Intelligencer, Volume 37, Number 4, 2015. [2] I used f[z] = 10 JacobiSN[I z, 1.5]. I multiplied the argument by _i_ because I wanted to rotate the picture 90 degrees. And I multiplied the output by 10 to get a less saturated image.
The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a _distance_, but it's not a distance in the usual sense because it's not symmetric. At first this asymmetry may seem like a bug, but it's a feature. We'll explain why it's useful to measure the difference between two probability distributions in an asymmetric way.
The Kullback-Leibler divergence between two random variables _X_ and _Y_ is defined as
This is pronounced/interpreted several ways:
* The divergence from _Y_ to _X_
* The relative entropy of _X_ with respect to _Y_
* How well _Y_ approximates _X_
* The information gain going from the prior _Y_ to the posterior _X_
* The average surprise in seeing _Y_ when you expected _X_
A theorem of Gibbs proves that K-L divergence is non-negative. It's clearly zero if _X_ and _Y_ have the same distribution.
The K-L divergence of two random variables is an expected value, and so it matters which distribution you're taking the expectation with respect to. That's why it's asymmetric.
As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.
The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.
Here's some Python code to compute the divergences.
from scipy.integrate import quad from scipy.stats import expon, gamma from scipy import inf def KL(X, Y): f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x)) return quad(f, 0, inf) e = expon g = gamma(a = 2) print( KL(e, g) ) print( KL(g, e) )
This returns
(0.5772156649008394, 1.3799968612282498e-08) (0.4227843350984687, 2.7366807708872898e-09)
The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it's more surprising to see a gamma when expecting an exponential than vice versa.
Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose _X_ and _Y_ are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the

`KL`

function to integrate over the whole real line
def KL(X, Y): f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x)) return quad(f, -inf, inf)
and trying an example.
n1 = norm(1, 1) n2 = norm(2, 1) print( KL(n1, n2) ) print( KL(n2, n1) )
This returns
(0.4999999999999981, 1.2012834963423225e-08) (0.5000000000000001, 8.106890774205374e-09)
and so both integrals are equal to within the error in the numerical integration.
As companies get into data analysis for the first time, many of them are going to start by making the same mistakes that were common a century ago, then gradually recapitulate the development of modern statistics.

Fitting a polynomial to a function at more points might not produce a better approximation. This is Faber's theorem, something I wrote about the other day.
If the function you're interpolating is smooth, then interpolating at more points may or may not improve the fit of the interpolation, depending on where you put the points. The famous example of Runge shows that interpolating
_f_(_x_) = 1 / (1 + _x_²)
at more points can make the fit worse. When interpolating at 16 evenly spaced points, the behavior is wild at the ends of the interval.
Here's the Python code that produced the plot.
import matplotlib.pyplot as plt from scipy import interpolate, linspace def cauchy(x): return (1 + x**2)**-1 n = 16 x = linspace(-5, 5, n) # points to interpolate at y = cauchy(x) f = interpolate.BarycentricInterpolator(x, y) xnew = linspace(-5, 5, 200) # points to plot at ynew = f(xnew) plt.plot(x, y, o, xnew, ynew, -) plt.show()
However, for smooth functions interpolating at more points _does_ improve the fit if you interpolate at the roots of Chebyshev polynomials. As you interpolate at the roots of higher and higher degree Chebyshev polynomials, the interpolants converge to the function being interpolated. The plot below shows how interpolating at the roots of _T_

_{16}, the 16th Chebyshev polynomial, eliminates the bad behavior at the ends. To make this plot, we replaced`x`

above with the roots of _T__{16}, rescaled from the interval [-1, 1] to the interval [-5, 5] to match the example above. x = [cos(pi*(2*k-1)/(2*n)) for k in range(1, n+1)] x = 5*array(x) What if the function we're interpolating isn't smooth? If the function has a step discontinuity, we can see Gibbs phenomena, similar to what we saw in the previous post. Here's the result of interpolating the indicator function of the interval [-1, 1] at 100 Chebyshev points. We get the same "bat ears" as before. RELATED: Help with interpolation
Fourier-Bessel series are analogous to Fourier series. And like Fourier series, they converge pointwise near a discontinuity with the same kind of overshoot and undershoot known as the Gibbs phenomenon.
FOURIER-BESSEL SERIES
Bessel functions come up naturally when working in polar coordinates, just as sines and cosines come up naturally when working in rectangular coordinates. You can think of Bessel functions as a sort of variation on sine waves. Or even more accurately, a variation on sinc functions, where sinc(_z_) = sin(_z_)/_z_. [1]
A Fourier series represents a function as a sum of sines and cosines of different frequencies. To make things a little simpler here, I'll only consider Fourier sine series so I don't have to repeatedly say "and cosine."
A Fourier-Bessel function does something similar. It represents a function as a sum of rescaled versions of a particular Bessel function. We'll use the Bessel _J_

_{0}here, but you could pick some other _J__{ν}. Fourier series scale the sine and cosine functions by π times integers, i.e. sin(π_z_), sin(2π_z_), sin(3π_z_), etc. Fourier-Bessel series scale by the zeros of the Bessel function: _J__{0}(λ_{1}_z_), _J__{0}(λ_{2}_z_), _J__{0}(λ_{3}_z_), etc. where λ_{_n_}are the zeros of _J__{0}. This is analogous to scaling sin(π_z_) by its roots: π, 2π, 3π, etc. So a Fourier-Bessel series for a function _f_ looks like The coefficients _c__{_n_}for Fourier-Bessel series can be computed analogously to Fourier coefficients, but with a couple minor complications. First, the basis functions of a Fourier series are orthogonal over [0, 1] without any explicit weight, i.e. with weight 1. And second, the inner product of a basis function doesn't depend on the frequency. In detail, Here δ_{_mn_}equals 1 if _m_ = _n_ and 0 otherwise. Fourier-Bessel basis functions are orthogonal with a weight _z_, and the inner product of a basis function with itself depends on the frequency. In detail So whereas the coefficients for a Fourier sine series are given by the coefficients for a Fourier-Bessel series are given by GIBBS PHENOMENON Fourier and Fourier-Bessel series are examples of orthogonal series, and so by construction they converge in the norm given by their associated inner product. That means that if _S__{_N_}is the _N_th partial sum of a Fourier series and the analogous statement for a Fourier-Bessel series is In short, the series converge in a (weighted) _L_² norm. But how do the series converge pointwise? A lot of harmonic analysis is devoted to answering this question, what conditions on the function _f_ guarantee what kind of behavior of the partial sums of the series expansion. If we look at the Fourier series for a step function, the partial sums converge pointwise everywhere except at the step discontinuity. But the way they converge is interesting. You get a sort of "bat ear" phenomena where the partial sums overshoot the step function at the discontinuity. This is called the Gibbs phenomenon after Josiah Willard Gibbs who observed the effect in 1899. (Henry Wilbraham observed the same thing earlier, but Gibbs didn't know that.) The Gibbs phenomena is well known for Fourier series. It's not as well known that the same phenomenon occurs for other orthogonal series, such as Fourier-Bessel series. I'll give an example of Gibbs phenomenon for Fourier-Bessel series taken from [2] and give Python code to visualize it. We take our function _f_(_z_) to be 1 on [0, 1/2] and 0 on (1/2, 1]. It works out that PYTHON CODE AND PLOT Here's the plot with 100 terms. Notice how the partial sums overshoot the mark to the left of 1/2 and undershoot to the right of 1/2. Here's the same plot with 1,000 terms. Here's the Python code that produced the plot. import matplotlib.pyplot as plt from scipy.special import j0, j1, jn_zeros from scipy import linspace N = 100 # number of terms in series roots = jn_zeros(0, N) coeff = [j1(r/2) / (r*j1(r)**2) for r in roots] z = linspace(0, 1, 200) def partial_sum(z): return sum([coeff[i]*j0(roots[i]*z) for i in range(N)]) plt.plot(z, partial_sum(z)) plt.xlabel("z") plt.ylabel("{}th partial sum".format(N)) plt.show() FOOTNOTES [1] To be precise, as _z_ goes to infinity and so the Bessel functions are asymptotically proportional to sin(_z_ - φ)/√_z_ for some phase shift φ. [2] The Gibbs' phenomenon for Fourier-Bessel Series. Temple H. Fay and P. Kendrik Kloppers. International Journal of Mathematical Education in Science and Technology. 2003. vol. 323, no. 2, 199-217.
I'm experimenting with making animated versions of the kinds of images I wrote about in my previous post. Here's an animated version of the exponential sum of the day for 12/4/17.
Why that date? I wanted to start with something with a fairly small period, and that one looked interesting. I'll have to do something different for the images that have a much longer period.
Image made in collaboration with Go 3D Now.

How do you create a database for testing that is like your production database? It depends on in what way you want the test database to be "like" the production one.
REPLACING SENSITIVE DATA
Companies often use an old version of their production database for testing. But what if the production database has SENSITIVE INFORMATION that software developers and testers should not have access to?
You can't completely remove customer phone numbers from the database, for example, if your software handles customer phone numbers. You have to replace in sensitive data with modified data. The question becomes how to modify it. Three approaches would be
* Use the original data.
* Generate completely new artificial data.
* Use the real data as a guide to generating new data.
We'll assume the first option is off the table and consider the pros and cons of the other two options.
For example, suppose you collect customer ages. You could replace customer age with a random two-digit number. That's fine as far as making sure that forms can display two-digit numbers. But maybe the age values matter. Maybe you want your fictional customers in the test database to have the same age distribution as your real customers. Or maybe you want your fictional customer ages to be correlated with other attributes so that you don't have 11 year-old retirees or 98 year-old clients who can't legally purchase alcohol.
RANDOM VS REALISTIC
There are pros and cons to having a realistic test database. A database filled with randomly generated data is likely to find MORE BUGS, but a realistic database is likely to find MORE IMPORTANT BUGS.
Randomly generated data may contain combinations that have yet to occur in the production data, combinations that will cause an error when they do come up in production. Maybe you've never sold your product to someone in Louisiana, and there's a latent bug that will show up the first time someone from Louisiana does order. (For example, Louisiana retains vestiges of French law that make it different from all other states.)
On the other hand, randomly generated data may not find the bugs that affect the most customers. You might want the values in your test database to be distributed similarly to the values in real data so that bugs come up in testing with roughly the same frequency as in production. In that case, you probably want the _joint_ distributions to match and not just the _unconditional_ distributions. If you just match the latter, you could run into oddities such as a large number of teenage retirees as mentioned above.
So do you want a random test database or a realistic test database? Maybe both. It depends on your purposes and priorities. You might want to start by testing against a realistic database so that you first find the bugs that are likely to affect the most number of customers. Then maybe you switch to a randomized database that is more effective at flushing out problems with edge cases.
HOW TO MAKE A REALISTIC TEST DATABASE
So how would you go about creating a realistic test database that protects customer privacy? The answer depends on several factors. First of all, it depends on what aspects of the real data you want to preserve. Maybe verisimilitude is more important for some fields than others. Once you decide what aspects you want your test database to approximate, how well do you need to approximate them? If you want to do valid statistical analysis on the test database, you may need something sophisticated like DIFFERENTIAL PRIVACY. But if you just want moderately realistic test cases, you can do something much simpler.
Finally, you have to address your PRIVACY-UTILITY TRADE-OFF. What kinds of privacy protection are you ethically and legally obligated to provide? For example, is your data consider PHI under HIPAA regulation? Once your privacy obligations are clear, you look for ways to maximize your utility subject to these privacy constraints.
If you'd like help with this process, let's talk. I can help you determine what your obligations are and how best to meet them while meeting your business objectives.

The exponential sum of the day draws a line between consecutive partial sums of
where _m_, _d_, and _y_ are the current month, day, and two-digit year. The four most recent images show how different these plots can be.
These images are from 10/30/17, 10/31/17, 11/1/17, and 11/2/17.
Consecutive dates often produce very different images for a couple reasons. First, consecutive integers are relatively prime. From a number theoretic perspective, 30 and 31 are very different, for example. (This touches on the motivation for _p_-adic numbers: put a different metric on integers, one based on their prime factorization.)
The other reason consecutive dates produce qualitatively different images is that you might roll over a month or year, such as going from October (10) to November (11). You'll see a big change when we roll over from 2017 to 2018.
The partial sums are periodic [1] with period lcm(_m_, _d_, _y_). The image for 10/31/17 above has the most points because 10, 31, and 17 are relatively prime. It's also true that 11, 2, and 17 are relatively prime, but these are smaller numbers.
You could think of the month, day, and year components of the sum as three different gears. The sums repeat when all three gears return to their initial positions. In the image for yesterday, 11/1/17, the middle gear is effectively not there.
[1] The _terms_ of the sums are periodic. The partial sums are periodic if the full sum is zero. So _if_ the partial sums are periodic, the lcm is a period.

I just got an evaluation copy of The Best Writing on Mathematics 2017. My favorite chapter was _Inverse Yogiisms_ by Lloyd N. Trefethen.
Trefethen gives several famous Yogi Berra quotes and concludes that

Yogiisms are statements that, if taken literally, are meaningless or contradictory or nonsensical or tautological—yet nevertheless convey something true.An inverse yogiism is the opposite,

[a] statement that is literally true, yet conveys something false.What a great way way to frame a chapter! Now that I've heard the phrase, I'm trying to think of inverse yogiisms. Nothing particular has come to mind yet, but I feel like there must be lots of things that fit that description. Trefethen comes up with three inverse yogiisms, and my favorite is the middle one: Faber's theorem on polynomial interpolation. Faber's theorem is a non-convergence result for interpolants of continuous functions. Trefethen quotes several numerical analysis textbooks that comment on Faber's theorem in a way that implies an overly pessimistic interpretation. Faber's theorem is true for continuous functions in general, but if the function _f_ being interpolated is smooth, or even just Lipschitz continuous, the theorem doesn't hold. In particular, Chebyshev interpolation produces a sequence of polynomials converging to _f._ A few years ago I wrote a blog post that shows a famous example due to Carle Runge that if you interpolate _f_(_x_) = 1/(1 + _x_²) over [-5, 5] with evenly spaced nodes, the sequence of interpolating polynomials diverges. In other words, adding more interpolation points makes the fit _worse_. Here's the result of fitting a 16th degree polynomial to _f_ at evenly spaced nodes. The error near the ends is terrible, though the fit does improve in the middle. If instead of using evenly spaced nodes you use the roots of Chebyshev polynomials, the interpolating polynomials do in fact converge, and converge quickly. If the _k_th derivative of _f_ has bounded variation, then the error in interpolating _f_ at _n_ points is _O_(_n_

^{-_k_}).
According to this post from Stack Overflow, Perl is the most disliked programming language.
I have fond memories of writing Perl, though it's been a long time since I used it. I mostly wrote scripts for file munging, the task it does best, and never had to maintain someone else's Perl code. Under different circumstances I probably would have had less favorable memories.
Perl is a very large, expressive language. That's a positive if you're working alone but a negative if working with others. Individuals can carve out their favorite subsets of Perl and ignore the rest, but two people may carve out different subsets. You may personally avoid some feature, but you have to learn it anyway if your colleague uses it. Also, in a large language there's greater chance that you'll accidentally use a feature you didn't intend to. For example, in Perl you might use an array in a scalar context. This works, but not as you'd expect if you didn't intend to do it.
I suspect that people who like large languages like C++ and Common Lisp are more inclined to like Perl, while people who prefer small languages like C and Scheme have opposite inclinations.
RELATED POSTS:
* Extreme syntax
* Perl as a better …
* Comparing Perl and Python

Let ω(_n_) be the number of distinct prime factors of _x_. A theorem of Landau says that for _N_ large, then for randomly selected positive integers less than _N_, ω-1 has a Poisson(log log _N_) distribution. This statement holds in the limit as _N_ goes to infinity.
Apparently _N_ has to be extremely large before the result approximately holds. I ran an experiment for _N_ = 10,000,000 and the fit is not great.
Here's the data:
|----+----------+-----------| | | Observed | Predicted | |----+----------+-----------| | 1 | 665134 | 620420.6 | | 2 | 2536837 | 1724733.8 | | 3 | 3642766 | 2397330.6 | | 4 | 2389433 | 2221480.4 | | 5 | 691209 | 1543897.1 | | 6 | 72902 | 858389.0 | | 7 | 1716 | 397712.0 | | 8 | 1 | 157945.2 | | 9 | 0 | 54884.8 | | 10 | 0 | 16952.9 | |----+----------+-----------|
And here's a plot:
I found the above theorem in these notes. It's possible I've misunderstood something or there's an error in the notes. I haven't found a more formal reference on the theorem yet.
UPDATE: According to the results in the comments below, it seems that the notes I cited may be wrong. The notes say "distinct prime factors", i.e. ω(_n_), while numerical results suggest they meant to say Ω(_n_), the number of prime factors counted with multiplicity. I verified the numbers given below. Here's a plot.
Here's the Python code I used to generate the table. (To match the code for the revised graph, replace

`omega`

which calculated ω(_n_) with the SymPy function `primeomega`

which calculates Ω(_n_).)
from sympy import factorint from numpy import empty, log, log2 from scipy.stats import poisson N = 10000000 def omega(n): return len(factorint(n)) table = empty(N, int) table[0] = table[1] = 0 for n in range(2, N): table[n] = omega(n) # upper bound on the largest value of omega(n) for n < N. u = int(log2(N)) for n in range(1, u+1): print(n, len(table[table==n]), poisson(log(log(N))).pmf(n-1)*N)
RELATED POSTS
* Chowla conjecture
* Generating Poisson random values
* Prime factors, phone numbers, and the normal distribution
Yesterday I heard that the county I live in, Harris County, is the 3rd largest is the United States. (In population. It's nowhere near the largest in area.) Somehow I've lived here a couple decades without knowing that. Houston is the 4th largest city in the US, so it's no shock that Harris County the 3rd largest county, but I hadn't thought about it.
I knew that city populations followed a power law, so I wanted to see if county populations do too. I thought they would, but counties are a little different than cities. For example, cities grow and shrink over time but counties typically do not.
To cut to the chase, county populations do indeed follow a power law. They have the telltale straight line graph on a log-log plot. That is for the largest counties. The line starts to curve down at some point and later drops precipitously. That's typical. When you hear that something "follows a power law" that means it approximately has a power law distribution over some range. Nothing has _exactly_ a power law distribution, or any other ideal distribution for that matter. But some things follow a power law distribution more closely and over a longer range than others.
Even though Los Angeles County (10.1 million) is the largest by far, it doesn't stick out on a log scale. It's population compared to Cook County (5.2 million) and Harris County (4.6 million) is unremarkable for a power law.

I was playing around with something in Mathematica and one of the images that came out of it surprised me.
It's a contour plot for the system function of a low pass filter.
H[z_] := 0.05634*(1 + 1/z)*(1 - 1.0166/z + 1/z^2) / ((1 - 0.683/z)*(1 - 1.4461/z + 0.7957/z^2)) ContourPlot[ Arg[H[Exp[I (x + I y)]]], {x, -1, 1}, {y, -1, 1}, ColorFunction -> "StarryNightColors"]
It looks sorta like a cross between Paul Klee's painting Senecio
and Perry the Platypus from Phineas and Ferb.

Series solutions to differential equations can be grubby or elegant, depending on your perspective.
POWER SERIES SOLUTIONS
At one level, there's nothing profound going on. To find a series solution to a differential equation, assume the solution has a power series, stick the series into the equation, and solve for the coefficients. The mechanics are tedious, but you just follow your nose.
But why should this work? Why should the solution have a power series? And why should it be possible to find the coefficients? After all, there are infinitely many of them!
As for why the solution should have a power series, sometimes it doesn't. But there is a complete theory to tell you when the solution should have a power series or not. And when it doesn't have a power series, it may have more general series solution.
RECURRENCE RELATIONS
The tedious steps in the middle of the solution process may obscure what's happening between the first and last step; the hard part in the middle grabs our attention. But the big picture is that series solution process transforms a DIFFERENTIAL EQUATION for a function _y_(_x_) into a RECURRENCE RELATION for the coefficients in the power series for _y_. We turn a continuous problem into a discrete problem, a hard problem into an easy problem.
To look at an example, consider Airy's equation:
_y_" - _xy_ = 0.
I won't go into all the details of the solution process—my claim above is that these details obscure the transformation I want to emphasize—but skip from the first to the last step.
If you assume _y_ has a power series at 0 with coefficients _a_

_{_n_}, then we find that the coefficients satisfy the recurrence relation (_n_ + 2)(_n_ + 1)_a__{_n_+2}= _a__{_n_-1}for _n_ ≥ 1. Initial conditions determine _a__{0}and _a__{1}, and it turns out _a__{2}must be 0. The recurrence relation shows how these three coefficients determine all the other coefficients. HOLONOMIC FUNCTIONS AND SEQUENCES There's something interesting going on here, a sort of functor mapping differential equations to recurrence relations. If we restrict ourselves to differential equations and recurrence relations with polynomial coefficients, we get into the realm of holonomic functions. A HOLONOMIC FUNCTION is one that satisfies a linear differential equation with polynomial coefficients. A HOLONOMIC SEQUENCE is a sequence that satisfies a linear recurrence relation with polynomial coefficients. Holonomic functions are the generating functions for holonomic sequences. Holonomic functions are interesting for a variety of reasons. They're a large enough class of functions to include many functions of interest, like most special functions from analysis and a lot of generating functions from combinatorics. And yet they're a small enough class to have nice properties, such as closure under addition and multiplication. Holonomic functions are determined by a finite list of numbers, the coefficients of the polynomials in their defining differential equation. This means they're really useful in computer algebra because common operations ultimately map one finite set of numbers to another.
You can find any integer you want as a substring of the digits in π. (Probably. See footnote for details.) So you could encode a number by reporting where it appears.
If you want to encode a single digit, the best you can do is break even: it takes at least one digit to specify the location of another digit. For example, 9 is the 5th digit of π. But 7 doesn't appear until the 13th digit. In that case it would take 2 digits to encode 1 digit.
What if we work in another base _b_? Nothing changes as long as we also describe positions in base _b_. But there's a _possibility_ of compression if we look for digits of base _b_ but describe their position using base _p_ numbers where _p_ < _b_. For example, 15 is the 2nd base-100 "digit" in π.
BLOCKS OF DIGITS
For the rest of this post we'll describe blocks of _k_ digits by their position as a base 10 number. That is, we'll use _p_ = 10 and _b_ = 10

^{_k_}in the notation above. There are ten 2-digit numbers that can be described by a 1-digit position in π: 14, 15, 32, …, 92. There are 57 more 2-digit numbers that can be described by a 2-digit position. The other 33 2-digit numbers require 3 digits to describe their position. So if we encode a 2-digit number by its position in pairs of digits in π, there's a 1/10 chance that we'll only need 1 digit, i.e. a 10% chance of compression. The chance that we'll break even by needing two digits is 57/100, and the chance that we'll need three digits, i.e. more digits than what we're encoding, is 33/100. On average, we expect to use 2.23 digits to encode 2 digits. If we look at 3-digit numbers, there are 10 that can be described by a 1-digit position, 83 that require 2 digits, 543 that require 3 digits, and 364 that require 4 digits. So on average we'd need 3.352 digits to encode 3 digits. Our "compression" scheme makes things about 8% larger on average. With 4-digit numbers, we need up to 5 digits to describe the position of each, and on average we need 4.2611 digits. With 5-digit numbers, we need at least 7 digits to describe each position. As I write this, my program is still running. The positions of 99,994 five-digit numbers can be described by numbers with at most 6 digits. The remaining 6 need at least 7 digits. Assuming they need exactly seven digits, we need an average of 5.2623 digits to encode 5-digit numbers by their position in π. COMPRESSION SCHEME EFFICIENCY If we're assuming the numbers we're wishing to encode are uniformly distributed, the representation as a location in π will take more digits than the number itself, on average. But ALL COMPRESSION ALGORITHMS EXPAND THEIR CONTENTS ON AVERAGE if by "average" you mean picking all possible inputs with the same probability. Uniform randomness is not compressible. It takes _n_ bits to describe _n_ random bits. Compression methods are practical because their inputs are _not_ completely random. For example, English prose compresses nicely because it's not random. If we had some reason to suspect that a number came from a block of digits in π, and one not too far our, then the scheme here could be a form of compression. The possibility of efficient compression comes from an assumption about our input. EXTENSIONS You could extend the ideas in this post theoretically or empirically, i.e. you could write a theorem or a program. Suppose you look at blocks of _k_ base-_b_ numbers whose position is described as a base _b_ number. The case _b_ = 2 seems particularly interesting. What is the compression efficiency of the method as _k_ varies? What is it for particular _k_ and what is it in the limit as _k_ does to infinity? You could look at this empirically for digits of π, or some other number, or you could try to prove it theoretically for digits of a normal number. FOOTNOTE ON NORMAL NUMBERS It might not be true that you can find any integer in the digits of π, though we know it's true for numbers of the size we're considering here. You can find any number in the digits of π if it is a "normal number." Roughly speaking, a number is normal if you can find any pattern in its digits with the probability you'd expect. That is, 1/10 of the digits in its decimal expansion will be 7's, for example. Or if you grouped the digits in pairs, thinking of the number as being in base 100, 1/100 of the pairs to be 42's, etc. Almost all numbers are normal, so in a sense, the probability that π is normal is 1. Even though almost all numbers are normal, nobody has been able to prove that that any familiar number is normal: π, _e_, √2, etc. It's possible that every integer appears in the decimal expansion of π, but not necessarily with the expected probability. This would be a weaker condition than normality. Maybe this has been proven for π, but I don't know. It would be strange if every number appeared in the digits of π but with some bias.
Terry Tao's most recent blog post looks at the Chowla conjecture theoretically. This post looks at the same conjecture empirically using Python. (Which is much easier!)
The Liouville function λ(_n_) is (-1)

^{Ω(_n_)}where Ω(_n_) is the number of prime factors of _n_ counted _with multiplicity_. So, for example, Ω(9) = 2 because even though 9 has only one distinct prime factor, that factor counts twice. Given some set of _k_ strides _h__{1}, _h__{2}, …, _h__{_k_}, define _f_(_n_) = λ(_n_ + _h__{1}) λ(_n_ + _h__{1}) … λ(_n_ + _h__{_k_}). The Chowla conjecture says that the average of the first _N_ values of _f_(_n_) tends to zero as _N_ goes to infinity. This has been proven for _k_ = 1 but not for larger _k_. If _f_(_n_) acts like a Bernoulli random variable, our experiments might increase our confidence in the conjecture, but they wouldn't prove anything. Unexpected results wouldn't prove anything either, but a departure from random behavior might help find a counterexample if the conjecture is false. We're going to be evaluating the Liouville function repeatedly at the same arguments, so it will save some compute time if we tabulate its values. This will also let us use some compact Python notation to average _f_. We'll need to tabulate the function up to _N_ + _h__{_k_}. In the code below,`maxstride`

is an upper bound on the strides _h__{_k_}we may look at. SymPy has a function`primeomega`

that calculates Ω(_n_) so we might as well use it. If you wanted to use a very large value of _N_, you might want to fill the array of Liouville function values using a more direct approach that avoids all the factoring implicit in calls to `primeomega.`

from sympy import primeomega from numpy import empty N = 10000 maxstride = 100 liouville = empty(N + maxstride) liouville[0] = 1 for n in range(1, len(liouville)): liouville[n] = (-1)**primeomega(n)
The following code looks at the Chowla conjecture for _h__{1}= 0 and _h__{2}ranging over 1, 2, …, 99. average = empty(maxstride-1) for s in range(1, maxstride): average[s-1] = (liouville[0:N] * liouville[s:N+s]).sum()/N If the Liouville function is distributed like a random variable taking on -1 and 1 with equal probability, we'd expect the averages to vary like samples form a normal distribution with mean 0 and variance 1/(2_N_). print(average.mean()) print(average.std()) print( (2*N)**-0.5 ) This returns 0.00141 0.00851 0.00707 and so the means are indeed near zero, and the standard deviation of the samples is about what we'd expect. What if we replace Ω(_n_), the number of prime factors _with multiplicity_, with ω(_n_), the number of _distinct_ prime factors? The averages above are still small, but the sample variance is about twice as big as before. I've seen this pattern with several different large values of _N_, so I suspect there's something going on here. (I didn't see a function in SymPy corresponding to ω(_n_), but you can create your own with`len(factorint(n))`

.
Time series analysis and digital signal processing are closely related. Unfortunately, the two fields use different terms to refer to the same things.
Suppose you have a sequence of inputs _x_[_n_] and a sequence of outputs _y_[_n_] for integers _n_.
MOVING AVERAGE / FIR
If each output depends on a linear combination of a finite number of previous _inputs_
_y_[_n_] = _b_

_{0}_x_[_n_] + _b__{1}_x_[_n_-1] + … + _b__{_q_}_x_[_n_ - _q_] then time series analysis would call this a MOVING AVERAGE (MA) model of order _q_, provided _b__{0}= 1. Note that this might not really be an average, i.e. the _b_'s are not necessarily positive and don't necessarily sum to 1. Digital signal processing would call this a FINITE IMPULSE RESPONSE (FIR) filter of order _q_. AUTOREGRESSIVE / IIR If each output depends on a linear combination of a finite number of previous _outputs_ _y_[_n_] = _a__{1}_y_[_n_-1] + … + _a__{_p_}_y_[_n_ - _p_] then time series analysis would call this an AUTOREGRESSIVE (AR) model of order _p_. Digital signal processing would call this an INFINITE IMPULSE RESPONSE (IIR) filter of order _p. _ Sometimes you'll see the opposite sign convention on the _a_'s. ARMA / IIR If each output depends on a linear combination of a finite number of previous inputs _and_ outputs _y_[_n_] = _b__{0}_x_[_n_] + _b__{1}_x_[_n_-1] + … + _b__{_q_}_x_[_n_ - _q_] + _a__{1}_y_[_n_-1] + … + _a__{_p_}_y_[_n_ - _p_] then time series analysis would call this an AUTOREGRESSIVE MOVING AVERAGE (ARMA) model of order _(__p_, _q_), i.e. _p_ AR terms and _q_ MA terms. Digital signal processing would call this an INFINITE IMPULSE RESPONSE (IIR) filter with _q_ feedforward coefficients and _p_ feedback coefficients. Also, as above, you may see the opposite sign convention on the _a_'s. ARMA NOTATION Box and Jenkins use _a_'s for input and _z_'s for output. We'll stick with _x_'s and _y'_s to make the comparison to DSP easier. Using the backward shift operator _B_ that takes a sample at _n_ to the sample at _n_-1, the ARMA system can be written φ(_B_) _y_[_n_] = θ(_B_) _x_[_n_] where φ and θ are polynomials φ(_B_) = 1 - φ_{1}_B_ - φ_{2}_B_² - … φ_{_p_}_B_^{_p_}and θ(_B_) = 1 - θ_{1}_B_ - θ_{2}_B_² - … θ_{_q_}_B_^{_q_}. SYSTEM FUNCTION NOTATION In DSP, filters are described by their SYSTEM FUNCTION, the _z_-transform of the impulse response. In this notation (as in Oppenheim and Shafer, for example) we have The φ_{_k_}in Box and Jenkins correspond to the _a__{_k_}in Oppenheim and Schafer. The θ_{_k_}correspond to the (negative) _b__{_k_}. The system function _H_(_z_) corresponds to θ(1/_z_) / φ(1/_z_). RELATED DSP and time series consulting
Authors will often say that "without loss of generality" they will assume that a differential equation has no first order derivative term. They'll explain that there's no need to consider
because a change of variables can turn the above equation into one of the form
While this is true, the change of variables is seldom spelled out. I'll give the change of variables explicitly here in case this is helpful to someone in the future. Define _u_(_x_) and _r_(_x_) by
and
With this change of variables
Proof: Calculate _u_" + _ru_ and use the fact that _y_ satisfies the original differential equation. The calculation is tedious but routine.
Example: Suppose we start with
Then dividing by _x_ we get
Now applying the change of variables above gives
and our original _y_ is _u_ / √ _x_.

Mathematical writing is the opposite of business writing in at least one respect. Math uses common words as technical terms, whereas business coins technical terms to refer to common ideas.
There are a few math terms I use fairly often and implicitly assume readers understand. Perhaps the most surprising is ALMOST as in "almost everywhere." My previous post, for example, talks about something being true for "almost all _x_."
The term "almost" sounds vague but it actually has a precise technical meaning. A statement is true _almost everywhere_, or holds for _almost all_ _x_, if the set of points where it doesn't hold has measure zero.
For example, almost all real numbers are irrational. There are infinitely many rational numbers, and so there are a lot of exceptions to the statement "all real numbers are irrational," but the set of exceptions has measure zero [1].
In common parlance, you might use BALL and SPHERE interchangeably, but in math they're different. In a normed vector space, the set of all points of norm no more than _r_ is the _ball_ of radius _r_. The set of all points with norm exactly _r_ is the _sphere_ of radius _r_. A sphere is the surface of a ball.
The word SMOOTH typically means "infinitely differentiable," or depending on context, differentiable as many times as you need. Often there's no practical loss of generality in assuming something is infinitely differentiable when you only need to know, for example, that it only needs three derivatives [2]. For example, a manifold whose charts are once differentiable can always be altered slightly to be infinitely differentiable.
The words REGULAR and NORMAL are used throughout mathematics as technical terms, and their meaning changes completely depending on context. For example, in topology _regular_ and _normal_ are two kinds of separation axioms. They tell you whether a topology has enough open sets to separate a point from a closed set or separate two closed sets from each other.
When I use _normal_ I'm most often talking about a normal (i.e. Gaussian) probability distribution. I don't think I use _regular_ as a technical term that often, but when I do it probably means something like _smooth_, but more precise. A regularity result in differential equations, for example, tells you what sort of desirable properties a solution has: whether it's a classical solution or only a weak solution, whether it's continuous or differentiable, etc.
While I'm giving a sort of reader's guide to my terminology, LOG always refers to natural log and TRIG FUNCTIONS are always in radians unless noted otherwise. More on that here.
* * *
The footnotes below are much more technical than the text above.
[1] Here's a proof that any countable set of points has measure zero. Pick any ε > 0. Put an open interval of width ε/2 around the first point, an interval of width ε/4 around the second point, an interval of width ε/8 around the third point etc. This covers the countable set of points with a cover of measure ε, and since ε as arbitrary, the set of points must have measure 0.
The irrational numbers are uncountable, but that's not why they have positive measure. A countable set has measure zero, but a set of measure zero may be uncountable. For example, the Cantor set is uncountable but has measure zero. Or to be more precise, I should say the _standard_ Cantor set has measure zero. There are other Cantor sets, i.e. sets homoemorphic to the standard Cantor set, that have positive measure. This shows that "measure zero" is not a topological property.
[2] I said above that often it doesn't matter how many times you can differentiate a function, but partial differential equations are an exception to that rule. There you'll often you'll care exactly how many (generalized) derivatives a solution has. And you'll obsess over exactly which powers of the function or its derivatives are integrable. The reason is that a large part of the theory revolves around embedding theorems, whether this function space embeds in that function space. The number of derivatives a function has and the precise exponents _p_ for the Lebesgue spaces they live in matters a great deal. Existence and uniqueness of solutions can hang on such fine details.

If you take the fractional parts of the set of numbers {_n_ cos _nx_ : integer _n_ > 0} the result is uniformly distributed for almost all _x_. That is, in the limit, the number of times the sequence visits a subinterval of [0, 1] is proportional to the length of the interval. (Clearly it's not true for _all_ _x_: take _x_ = 0, for instance. Or any rational number times π.)
The proof requires some delicate work with Fourier analysis that I'll not repeat here. If you're interested in the proof, see Theorem 4.4 of Uniform Distribution of Sequences.
This is a surprising result: why should such a strange sequence be uniformly distributed? Let's look at a histogram to see whether the theorem is plausible.
OK. Looks plausible.
But there's a generalization that's even more surprising. Let _a_

_{_n_}be _any_ increasing sequence of integers. Then the fractional parts of _a__{_n_}cos _a__{_n_}_x_ are uniformly distributed for almost all _x_. _Any_ increasing sequence of integers? Like the prime numbers, for example? Apparently so. The result produces a very similar histogram. But this can't hold for just any sequence. Surely you could pick an integer sequence to thwart the theorem. Pick an _x_, then let _a__{_n_}be the subset of the integers for which _n_ cos _nx_ < 0.5. Then _a__{_n_}cos _a__{_n_}_x_ isn't uniformly distributed because it never visits the right half the unit interval! Where's the contradiction? The theorem doesn't start by saying "For almost all _x_ …." It starts by saying "For any increasing sequence _a__{_n_}…." That is, you don't get to pick _x_ first. You pick the sequence first, then you get a statement that is true for almost all _x_. The theorem is true for every increasing sequence of integers, but the exceptional set of _x_'s may be different for each integer sequence.
Probability has surprising uses, including applications to things that absolutely are not random. I've touched on this a few times. For example, I've commented on how arguments about whether something is really random are often moot: Random is as random does.
This post will take non-random uses for probability in a different direction. We'll start with discussion of probability and end up proving a classical analysis result.
Let _p_ be the probability of success on some experiment and let _X_

_{_n_}count the number of successes in _n_ independent trials. The expected value of _X__{_n_}is _np_, and so the expected value of _X__{_n_}/_n_ is _p_. Now let _f _ be a continuous function on [0, 1] and think about _f_(_X__{_n_}/_n_). As _n_ increases, the distribution of _X__{_n_}/_n_ clusters more and more tightly around _p_, and so _f_(_X__{_n_}/_n_) clusters more and more around _f_(_p_). Let's illustrate this with a simulation. Let _p_ = 0.6, _n_ = 100, and let _f_(_x_) = cos(_x_). Here's a histogram of random samples from _X__{_n_}/_n_. And here's a histogram of random samples from _f_(_X__{_n_}/_n_). Note that it's centered near cos(_p_) = 0.825. As _n_ gets large, the expected value of _f_(_X__{_n_}/_n_) converges to _f_(_p_). (You could make all this all rigorous, but I'll leave out the details to focus on the intuitive idea.) Now write out the expected value of _f_(_X__{_n_}/_n_). Up to this point we've thought of _p_ as a fixed probability, and we've arrived at an approximation result guided by probabilistic thinking. But the expression above is approximately _f_(_p_) regardless of how we think of it. So now let's think of _p_ varying over the interval [0, 1]. E[ _f_(_X__{_n_}/_n_) ] is an _n_th degree POLYNOMIAL in _p_ that is close to _f_(_p_). In fact, as _n_ goes to infinity, E[ _f_(_X__{_n_}/_n_) ] converges uniformly to _f_(_p_). So for an arbitrary continuous function _f_, we've constructed a sequence of polynomials that converge uniformly to _f_. In other words, we've proved the WEIERSTRASS APPROXIMATION THEOREM! The Weierstrass approximation theorem says that there exists a sequence of polynomials converging to any continuous function _f_ on a bounded interval, and we've explicitly constructed such a sequence. The polynomials E[ _f_(_X__{_n_}/_n_) ] are known as the BERNSTEIN POLYNOMIALS associated with _f_. This is one of my favorite proofs. When I saw this in college, it felt like the professor was pulling a rabbit out of a hat. We're talking about probability when all of a sudden, out pops a result that has nothing to do with randomness. Now let's see an example of the Bernstein polynomials approximating a continuous function. Again we'll let _f_(_x_) = cos(_x_). If we plot the Bernstein polynomials and cosine on the same plot, the approximation is good enough that it's hard to see what's what. So instead we'll plot the differences, cosine minus the Bernstein approximations. As the degree of the Bernstein polynomials increase, the error decreases. Also, note the vertical scale. Cosine of zero is 1, and so the errors here are small relative to the size of the function being approximated.
There are many conventions for describing points on a sphere. For example, does latitude zero refer to the North Pole or the equator? Mathematicians tend to prefer the former and geoscientists the latter. There are also varying conventions for longitude.
Volker Michel describes this clash of conventions colorfully in his book on constructive approximation.

Many mathematicians have faced weird jigsaw puzzles with misplaced continents after using a data set from a geoscientist. If you ever get such figures, too, or if you are, for example, desperately searching South America in a data set but cannot find it, remember the remark you have just read to solve your problem.RELATED POSTS: * Duality in spherical trigonometry * Latitude doesn't exactly mean what I thought

The latest episode of Star Trek Discovery (S1E4) uses the word "supercomputer" a few times. This sounds jarring. The word has become less common in contemporary usage, and seems even more out of place in a work of fiction set more than two centuries in the future.
According to Google's Ngram Viewer, the term "supercomputer" peaked in 1990.
(The term "cluster" is far more common, but it is mostly a non-technical term. It's used in connection with grapes, for example, much more often than with computers.)
Years ago you'd hear someone say a problem would take a "big computer," but these days you're more likely to hear someone say a problem is going to take "a lot of computing power." Hearing that a problem is going to require a "big computer" sounds as dated as saying something would take "a big generator" rather than saying it would take a lot of electricity.
Like electricity, computing power has been commoditized. We tend to think in terms of the total amount of computing resources needed, measured in, say, CPU-hours or number of low-level operations. We don't think first about what configuration of machines would deliver these resources any more than we'd first think about what machines it would take to deliver a certain quantity of electrical power.
There are still supercomputers and problems that require them, though an increasing number of computationally intense projects do not require such specialized hardware.