- elchananHaasThe consequence of Noether's theorem is that if a system is time symmetric then energy is conserved. On a global perspective, the universe isn't time symmetric. It has a beginning and an expansion through time. This isn't reversible so energy isn't conserved.
- RL before LLMs can very much learn new behaviors. Take a look at AlphaGo for that. It can also learn to drive in simulated environments. RL in LLMs is not learning the same way, so it can't create it's own behaviors.
- Oof, I think that you are right. The issue with Futurelock is a failure of liveness, where the Future holding the lock doesn't get polled. tokio::join! would keep it alive and therefore my suggestion would mistakenly panic.
Yeah, the true fix is probably some form of the fabled Linear types/Structured concurrency where you can guarantee liveness properties.
- Disclaimer - I'm not a Tokio dev so what I say may be very wrong. Some definitions:
My understanding is that Tokio async locks have a queue of tasks waiting on lock. When a lock is unlocked, the runtime polls the task at the front of the queue. Futurelock happens when the task locks the lock, then attempts to lock it a second time. This can happen when a sub-future of the top level task already has the lock, then it polls a different future which tries to take the lock.Future = a structure with a method poll(self: Pin<&mut Self>, ...) -> Poll<Self::Output>; Futures are often composed of other futures and need to poll them. Tokio task = A top-level future that is driven by the Tokio runtime. These are the only futures that will be run even if not polled.This situation should be detectable because Tokio tracks which task is holding an async lock. One improvement could be to panic when this deadlock is spotted. This would at least make the issue easier to debug.
But yes, I think you are right in that the async mutex would need to take the future by value if it has the capability of polling it.
- It is well known that Rust's async model can lead to deadlocks. However, in the futurelock case I have come around to blaming the Async Locks. The issues is that they are not "fair" in that they don't poll the future holding the lock. There may be some other tradeoffs that would happen if the locks were in some way "fairer" but I think they should be explored.
- I would say, though, that for most programs any one of the most popular languages would do the trick. By this I mean Java, Go, C#. Javascript, Python, C++. All of those are general purpose multi-paradigm languages that you can code almost anything in.
That being said, some programs can only be written in one of those. Browser code is JS exclusive, low-level needs C++, secure code needs not C++. Machine Learning needs Python and high performance can't use Python. Some Windows things need C#. Those cases are the obvious ones where there is basically no choice. Beyond those, it is mostly about the team.
- My view of this is that its closer to the basic 2 lock Deadlock.
Thread 1 acquires A. Thread 2 acquires B. Thread 1 tries to acquire B. Thread 2 tries to acquire A.
In this case, the role "A" is being played by the front of the Mutex's lock queue. Role "B" is being played by the Tokio's actively executed task.
Based on this understanding, I agree that the surprising behavior is due to Tokio's Mutex/Lock Queue implementation. If this was an OS Mutex, and a thread waiting for the Mutex can't wake for some reason, the OS can wake a different thread waiting for that Mutex. I think the difficulty in this approach has to do with how Rust's async is implemented. My guess is the algorithm for releasing a lock goes something like this:
1. Pop the head of the wait queue. 2. Poll the top level tokio::spawn'ed task of the Future that is holding the Mutex.
What you want is something like this
For each Future in the wait queue (Front to Back): Poll the Future. If Success - Break ???Something if everything fails???
The reason this doesn't work has to do with how futures compose. Futures compile to states within a state machine. What happens when a future polled within the wait queue completes? How is control flow handed back to the caller?
I guess you might be able to have some fallback that polls the futures independently then polls the top level future to try and get things unstuck. But this could cause confusing behavior where futures are being polled even though no code path within your code is await'ing them. Maybe this is better though?
- DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
- The start is good, but later paragraphs appear to be mostly LLM.
- Video processing is one of those things that need caution when doing serverlessly. This solution makes sense, especially because S3s durability guarantees aren't needed.
- TLDR - this RNG is completely and totally broken.
First, I don't think the error term is contributing much to the solution. It almost never affects the high bit. In addition, it isn't fed back into updating the secret vectors, so I think an analysis can pretend it doesn't exist.
The nonlinear step where each value is squared looks questionable to me. You will only produce quadratic residues (https://en.wikipedia.org/wiki/Quadratic_residue) when you square the numbers. This roughly halves the number of possibilities.
So what this really boils down to is this:
You have a matrix G and a vector s and a prime p. You repeatedly compute s' = Gs (Hadamard) Gs mod p. Each time you run this step you are projecting into a space with dimensionality (p/2)^N from a space p^N. My guess is that most operations will get trapped into short cycles.
Using you example values, after 10 iterations it gets to [9, 16, 13, 8]. This then repeats with a cycle length of 20. Given 4 values with p = 17 you could get up to 83520 values before repeating.
In some random tests, 6 values with p=97 enters a cycle at iteration 3802 even though there were 832,972,004,929 values.
6 values with p=271 enters a cycle at iteration 166,684 even though there were 396,109,944,105,121 values.
- AWS, for the ultimate backup, relies on a phone call bridge on the public phone network.
- For DynamoDB, I'm not sure but I think its covered. https://aws.amazon.com/dynamodb/sla/. "An "Error" is any Request that returns a 500 or 503 error code, as described in DynamoDB". There were tons of 5XX errors. In addition, this calculation uses percentage of successful requests, so even partial degradation counts against the SLA.
From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/
The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.
- My guess is that for IAM it has to do with consistency and security. You don't want regions disagreeing on what operations are authorized. I'm sure the data store could be distributed, but there might be some bad latency tradeoffs.
The other concerns could have to do with the impact of failover to the backup regions.
- Mostly AWS relies on each region being its own isolated copy of each service. It gets tricky when you have globalized services like IAM. AWS tries to keep those to a minimum.
- It kind of makes sense. It's good enough to stop a non-coder. Anything in the browser can be either broken by a serious coder or has unpleasant tradeoffs.
Amazon would need to drop this feature to seriously lock down their books
- And Apple did provide all iCloud data they had available.
- One more concern I noticed: This generative approach needs not only for each layer to select each output with uniform probability, but also for each layer to select each output with uniform probability regardless of the input.
This is the bad case I am concerned about.
Layer 1 -> (A, B) Layer 2 -> (C, D)
Lets say Layer 1 outputs A and B each with probability 1/2 (perfect split). Now, Layer 2 outputs C when it gets A as an input and D when it gets B as an input. Layer 2 is then outputting each output with probability 1/2, but it is not outputting each output with probability 1/2 when conditioned on the output of layer 1.
If this happens, the claim of exponential increase in diversity each layer breaks down.
It could be that the first-order approximation provided by Split-and-Prune is good enough. My guess though is that the gradient and the split-and-prune are helping each other to keep the outputs reasonably balanced on the datasets you are working on. The split and prune lets the optimization process "tunnel" though regions of the loss landscape that would make it hard to balance the classes.
- A thought on why the intermediate L2 losses are important: In the early layers there is little information so the L2 loss will be high and images blurry. In much deeper layers the information from the argmins will dominate and there will be little information left to learn. The L2 losses from the intermediate layers help this by providing a good training signal when there is some information known about the target, but there are still large unknowns.
The model can be thought of as N Discrete Distribution Networks, one of each depth 1 to N, that are stacked on each other and are being trained simultaneously.
- First, I think this is really cool. Its great to see novel generative architectures.
Here are my thoughts on the statistics behind this. First, let D be the data sample. Start with the expectation of -Log[P(D)] (standard generative model objective).
We then condition on the model output at step N.
- Expectation of Log[Sum over model outputs at step N{P(D | model output at step N) * P(model output at step N)}]
Now use Jensen's inequality to transform this to
<= - expectation of Sum over model outputs at step N{Log[P(D | model output at step N) * P(model output at step N)]}
Apply Log product to sum rule
= - expectation of Sum over model outputs at step N {Log(P(D | model output at step N)) + Log(P(model output at step N))}
If we assume there is some normally distributed noise we can transform the first term into the standard L2 objective.
= - expectation of Sum over model outputs at step N {L2 distance(D, model output at step N) + Log(P(model output at step N))}
Apply linearity of expectation
= Sum over model outputs at step N [expectation of{L2 distance(D, model output at step N)}] - Sum over model outputs at step N [expectation of {Log(P(model output at step N))}]
and the summations can be replaced with sampling
= expectation of {L2 distance(D model output at step N)} - expectation of {Log(P(model output at step N))}]
Now, focusing on just the - expectation of Log(P(sampled model output at step N)) term.
= - expectation of Log[P(model output at step N)]
and condition on the prior step to get
= - expectation of Log[Sum over possible samples at N-1 of (P(sample output at step N| sample at step N - 1) * P(sample at step N - 1))]
Now, for each P(sample at step T | sample at step T - 1) this is approximately equal to 1/K. This is enforced by the Split-and-Prune operations which try to keep each output sampled at roughly equal frequencies.
So this is approximately equal to
≃ - expectation of Log[Sum over possible samples at N-1 of (1/K * P(possible sample at step N - 1))]
And you get an upper bound by only considering the actual sample.
<= -Log[1/K * expectation of P(actual sample at step N - 1))]
And applying some log rules you get
= Log(K) - expectation of Log[P(sample at step N - 1)]
Now, you have (approximately) expectation of -Log[P(sample at step N)] <= Log(K) - expectation of Log[P(sample at step N - 1)]. You can repeatedly apply this transformation until step 0 to get
(approximately) expectation of -Log[P(sample at step N)] <= N * Log(K) - expectation of Log[P(sample at step 0)]
and WLOG assume that expectation of P(sample at step 0) is 1 to get
expectation of -Log[P(sample at step N)] <= N * Log(K)
Plugging this back into the main objective, we get (assuming the Split-and-Prune is perfect)
expectation of -Log[P(D)] <= expectation of {L2 distance(D, sampled model output at step N)} + N * Log(K)
And this makes sense. You are providing the model with an additional Log_2(K) bits of information every time you perform an argmin operation, so in total you have provided the model with N * Log_2(K) bits for information. However, this is constant so you can ignore it from the gradient based optimizer.
So, given this analysis my conclusions are:
1) The Split-and-Merge is a load-bearing component of the architecture with regards to its statistical correctness. I'm not entirely sure about how this fits with the gradient based optimizer. Is it working with the gradient based optimizer, fighting the gradient based optimizer, or somewhere in the middle? I think the answer to this question will strongly affect this approaches scalability. This will also need a more in-depth analysis to study how deviations from perfect splitting affect the upper bound on loss.
2) With regards to statistical correctness, the L2 distance between the output at step N and D is the only one that is important. The L2 losses in the middle layers can be considered auxiliary losses. Maybe the final L2 loss / L2 losses deeper in the model should be weighted more heavily? In final evaluation the intermediate L2 losses can be ignored.
3) Future possibilities could include some sort of RL to determine the number of samples K and depth N on a dynamic basis. Even a split with K=2 increases NLL loss by Log_2(2) = 1. For many samples after a given depth the increase in loss due to the additional information outweighs the decrease in L2 loss. This also points to another difficulty, it is hard to give fractional information in this Discrete Distribution Network architecture. In contrast, diffusion models and autoregressive models can handle fractional bits. This could be another point of future development.