- On your site you make the claim that: "Our thesis is that there is 100 years of physics and math research that has gone unnoticed by the CS/ML communities and we intend to rectify that."
Extraordinary claims require extraordinary evidence. Especially considering that a decent fraction of the CS/ML researchers that I know have solid physics and math backgrounds. Just of the top of my head, Marcus Hutter, David MacKay, Bernhard Scholkopf, Alex Smola, Max Welling, Christopher Bishop, etc. are/were prominent researchers with strong math and physics backgrounds. More recently Jared Kaplan and Dario Amodei at Anthropic also have physics backgrounds, as well as plenty of people at DeepMind.
To claim that you have noticed something in "100 years of physics and math research" that all of those people (and more) have missed and you didn't is pure hubris.
- I also had to convince my parents to buy me books about fractals. My prized possession as a 15 year old was a copy of Mandelbrot's "Fractal Geometry of Nature". A lot of it went over my head but it had some gorgeous colour plates and interesting sections. I still have it at home some 35 years later.
That also inspired me to write IFS code for ferns, Sierpinski gaskets, and Menger sponges in 68k assembler (after realizing AmigaBASIC was too slow).
- Thanks, I hadn't seen that term before.
Just to be clear, I wasn't claiming that "communicating clearly" is a new idea in software engineering, I'm mainly commenting on how effective embracing it can be.
When doing math, pretty much every term is "load-bearing" in that arguments will make use of specific aspects of a concept and how it relates to other concepts.
If you look at most graduate-level math textbooks or papers, they typically start with a whole bunch of numbered definitions that reference each other, followed by some simple lemmas or propositions that establish simple relationships between them before diving into more complex theorems and proofs.
The best software projects I've seen follow a roughly similar pattern: there are several "core" functions or libraries with a streamlined API, good docs, and solid testing; on top of that there are more complex processes that treat these as black-boxes and can rely on their behavior being well-defined and consistent.
Probably the common thread between math and programming is both lean heavily on abstraction as a core principle.
- A lot of those skills come from thinking about development in a team as a system and ask where do things frequently go wrong or take too long?
Practice clearly and concisely expressing what you understand the problem to be. This could be a problem with some code, some missing knowledge, or a bad process.
Check to see whether everyone understands and agrees. If not, try to target the root of the misunderstanding and try again. Sometimes you’ll need to write a short document to make things clear. Once there is a shared understanding then people can start taking about solutions. Once everyone agrees on a solution, someone can go implement it.
Like any skill, if you practice this loop often enough and take time to reflect on what worked and what didn’t, you slowly find that you develop a facility for it.
- Really nice presentation of divergences. I made a similar attempt (math + interactive JS visualizations) a long time ago for the Bregman divergences: https://mark.reid.name/blog/meet-the-bregman-divergences.htm...
These visualizations are much nicer than mine though.
Curious fact, the Bregman divergences are a different class of divergences to the f-divergences that intersect at the KL divergence. That is, KL is (essentially) the only divergence that is both an f-divergence and a Bregman divergence. This is basically because log turns a ratio into a difference.
- These are definitely thought-provoking questions and there are branches of mathematical philosophy such as constructivism and mathematical intuitionism that explore these.
Even if computation is not directly part of the laws of physics, knowing that humans and our computers are limited to things that are finite and computable might place limits on how we can appreciate how the universe works.
This is kind of a digression but if you (or others) are interested in some examples of things that are right on the edge of these questions you should check out the [busy beaver function](https://www.quantamagazine.org/amateur-mathematicians-find-f...). This tell you the maximum number of steps an n-state Turing machine can take before halting.
- Sure, “arbitrary” is a suitable term here too. A random process is just one way to generate an arbitrary function.
My use of “random” here was referring to the coin flipping process and was more for building intuition than precisely specifying all the other non-expressible functions. I was trying to allude to the fact that these other type of functions don’t have any easily expressible process behind them. When I’ve taught this stuff in the past I’ve found that people latch onto coin flipping more easily that imagining some arbitrary assignment of values.
For what it’s worth, I used to be a researcher in probability and information theory and have published papers on those topics so I am aware of the various technical definitions of randomness (Kolmogorov axioms, algorithmic probability theory, etc.)
I think you’re right about my comment being a little too lengthy for most people to find useful. I started explaining this stuff and got carried away.
- I think you are experiencing the same confusion I felt when I first started thinking about the difference between a program and a function.
The set of all possible mapping from all possible finite strings to booleans is definitely *not* countable.
What I (and the article) mean by a "function" `f : string -> boolean` here is any arbitrary assignment of a single boolean value to every possible string. Let's consider two examples:
1. Some "expressible" function like "f(s) returns True if the length of s is odd, and returns False otherwise".
2. Some "random" function where some magical process has listed out every possible string and then, for each string, flipped a coin and assigned that string True if the coin came up heads, and False if it came up tails and wrote down all the results in a infinite table and called that the function f.
The first type of "expressible" function is the type that we most commonly encounter and that we can implement as programs (i.e., a list of finitely many instructions to go from string to boolean).
Nearly all of the second type of function -- the "random" ones -- cannot be expressible using a program. The only way to capture the function's behavior is to write down the infinite look-up table that assigns each string a boolean value.
Now you are probably asking, "How do you know that the infinite tables cannot be expressed using some program?" and the answer is because there are too many possible infinite tables.
To give you an intuition for this, consider all the way we can assign boolean values to the strings "a", "b", and "c": there are 2^3 = 8. For any finite set X of n strings there will be 2^n possible tables that assign a boolean value to each string in X. The critical thing here is that 2^n is always strictly larger than n for all n >= 1.
This fact that there are more tables mapping strings to boolean than strings still holds even when there are infinitely many strings. What exactly we mean by "more" here is what Cantor and others developed. They said that a set A has more things than a set B if you consider all possible ways you can pair a thing from A with a thing from B there will always be things in A that are left over (i.e., not paired with anything from B).
Cantor's diagonalization argument applied here is the following: let S be the set of all finite strings and F be the set of all functions/tables that assigns a boolean to each element in S (this is sometimes written F = 2^S). Now suppose F was countable. By definition, that would mean that there is a pairing that assigns each natural number to exactly one function from F with none left over. The set S of finite strings is also countable so there is also a pairing from natural numbers to all elements of S. This means we can pair each element of S with exactly one element of F by looking up the natural number n assigned to s \in S and pairing s with the element f \in F that was also assigned to n. Crucially, what assuming the countability of F means is that if you give me a string s then there is always single f_s that is paired with s. Conversely, if you give me an f \in F there must be exactly one string s_f that is paired with that f.
We are going to construct a new function f' that is not in F. The way we do this is by defining f'(s) = not f_s(s). That is, f'(s) takes the input string s, looks up the function f_s that is paired with s, calculates the value of f_s(s) then flips its output.
Now we can argue that f' cannot be in F since it is different to every other function in F. Why? Well, suppose f' was actually some f \in F then since F is countable we know it must have some paired string s, that is, f' = f_s for some string s. Now if we look at the value of f'(s) it must be the same as f_s(s) since f' = f_s. But also f'(s) = not f_s(s) by the way we defined f' so we get that f_s(s) = not f_s(s) which is a contradiction. Therefore f' cannot be in F and thus our premise that F was countable must be wrong.
Another way to see this is that, by construction, f' is different to every other function f in F, specifically on the input value s that is paired with f.
Thus, the set F of all functions from strings to boolean must be uncountable.
- I had exactly the same reaction, which prompted me to write this comment: https://www.hackerneue.com/item?id=44122045
- This is a really nice explanation of decidability. One extra thing it might be worth mentioning is that there are many more functions `f : string -> boolean` then there are programs that implement those functions.
When I first encountered this topic I had trouble intuitively understanding how there could not exist an `IS_HALTING` function when it is also just a function that takes in a string (representing a program plus its inputs) and outputs True or False depending on whether it halts or not.
The argument in the article does a great job of showing that `IS_HALTING` cannot exist because it is in some sense "too powerful" but that means there is a mapping f : strings -> boolean that cannot be represented as a program, which seems weird if you've been programming for ages and every function you encounter is expressed as a program.
The result becomes less weird when you realize that that almost all functions from string -> boolean are not expressible as a program. Why? Well there are countable many programs since there are only countably many finite length strings and every program, by definition, is a finite length string. However, there are uncountably many functions from string -> boolean since these functions map one-to-one to sets of strings (just let the set be all inputs that map to True) and the cardinality of the set of sets of strings is uncountable.
This is essentially due to Cantor's diagonalization argument which shows you cannot put all elements in a set X into a 1-1 correspondence with all the subsets of X, even when X is countably infinite. This fact is at the heart of a lot of these computability results since it shows there is a gap between all functions (= any arbitrary subset of finite strings) and a program (= a finite string).
- The program Racter (which, from what I understand, was a basic MDP) was generating poetry in the 1980s that was published, read, criticized, and presented in museums: https://www.101bananas.com/poems/racter.html
I remember this as one of its poems was used on on the t-shirts of the computing social club that I was part of as a postgrad student:
More than iron More than lead More than gold I need electricity I need it more than I need lamb or pork or lettuce or cucumber I need it for my dreams - I also wrote a short blog post about this topic back in 2009: https://mark.reid.name/blog/warning-high-dimensions.html
It great to see these interesting math facts continue to be discussed and presented in new ways.
- If you want to go a more analog route, this series of synth DIY videos by Moritz Klein is really great: https://www.youtube.com/@MoritzKlein0
- This likelihood ratio approach highlights the fact that KL divergence is a member of the family of Csiszár F-divergences. These are measures of "distance" between distributions of the form E_Q[ F(p(x)/q(x)) ] where F is any convex function with F(1) = 0. This is kind of a generalization of log-likelihood where F kind of "weights" the badness of ratios different to 1. When F is -log you get KL divergence.
Another curious fact about KL divergences is they are also a Bregman divergence: take a convex function H and define B_H(P, Q) = \sum_x H(p(x)) - H(q(x)) - <∇H(q(x)), p(x) - q(x)>. These generalize pointwise square Euclidean distance. KL is obtained when H(P) is negative entropy \sum_x p(x) log p(x).
I spent a bunch of time studying divergences over distributions (e.g., see my blog post[1]) and in particular these two classes and the really neat fact about KL divergence is that it is essentially the only divergence that is both an F-divergence and a Bregman divergence. This is basically due to the property of log that turns logs of products into sums.
[1]: https://mark.reid.name/blog/meet-the-bregman-divergences.htm...
- "It is well known that stone can think, because the whole of electronics is based on that fact, but in some universes men spend ages looking for other intelligences in the sky without once looking under their feet. That is because they've got the time-span all wrong. From stone's point of view the universe is hardly created and mountain ranges are bouncing up and down like organ-stops while continents zip backward and forward in general high spirits, crashing into each other from the sheer joy of momentum and getting their rocks off. It is going to be quite some time before stone notices its disfiguring little skin disease and starts to scratch, which is just as well."
-- Terry Pratchett, "Equal Rites"
- It makes me glad to see that Aleph (and ILP more generally) are still being used. It's a really powerful and interpretable way of constructing code from examples.
I made a few contributions to Aleph as part of my PhD on doing transfer learning in ILP and really enjoyed working in Prolog.
- It might be easiest to give a sense of what "unprovable but true" means by way of an imagined example.
Goldbach's conjecture is that "every even number bigger than 2 is the sum of exactly two prime numbers", so 4 = 2 + 2, 6 = 3 + 3, 8 = 5 + 3, etc.
For this statement to be *true* it just means that every even number there must exist two primes that add to that number. This is a statement about infinitely many integers.
A *proof* of Goldbach's conjecture consists of a finite number of formal reasoning steps that start with some axioms and end up at the statement of the result.
To this day, it seems as though Goldbach's conjecture is true. It holds for every number we've been able to test. However, no one has proved it is true or false yet (or proved that it is unprovable).
The proof of Gödel's result's involves very carefully formalizing what statements and proofs mean so that they can be encoded as statements about arithmetic. He then shows there is a statement with encoding G that says "The statement with encoding G cannot be proved" – if it is true, then it cannot be proved.
It's kind of confusing at first, but there is an easier way to get an intuition why there might be true statements that cannot be proved. Think of each statement about the natural numbers as a subset where each number in the subset makes the statement true. There are uncountably many subsets of the natural numbers (by Cantor's diagonalization argument). Proofs are finite chains of finite statements so there are only countably infinitely many of these. Therefore there must be subsets/statements that are true that do not have a matching proof.
The approach that Gödel's proof takes is not too different to the above argument – it is essentially a diagonalization argument – the complexity is in making the encoding of statements are numbers very precise.
- I learnt the strategy of "satisficing" from Herbert Simons' "Sciences of the Artificial" book. It suggests not bothering to split hairs amongst alternatives once options have met some threshold for acceptance.
- I'm not the parent, but I think they are referring to William Sethares' book "Tuning, Timbre, Spectrum, Scale": https://www.springer.com/gp/book/9781852337971
It's a fascinating take on how those concepts relate to each other. You can find some of the papers that were synthesized into that book on his web page: https://sethares.engr.wisc.edu/pubs.html
I always found this one a little poignant: