I feel that analysis of this sort is incredibly appropriate for AI alignment. It seems rather plausible that neural networks, or anything just like them, will probably be applied as a element of AGI. If that's the situation, then we want to be able to reliably forecast and explanation about how neural networks behave in new predicaments, and how they connect with other techniques, and it is tough to assume how that may be doable and not using a deep understanding of the dynamics at play when neural networks learn from details.

Indicating that MLK was a "criminal" is A technique of saying that MLK imagined and acted as though he had a ethical accountability to break unjust rules also to consider immediate motion.

I also needs to Notice that I think that the fact that Gaussian procedures even perform at all by now in by itself provides us a fairly good reason to be expecting them to capture the vast majority of what can make NNs function in observe. For almost any specified operate approximator, if that operate approximator is very expressive then the "null hypothesis" need to be that it essentially won't generalise in the least.

By way of example, if a file could be compressed a whole lot by a ZIP file encoding, then that file has low Kolmogorov complexity, but not all minimal Kolmogorov complexity strings can be compressed by ZIP specially.

I am undecided I agree using this -- I feel Kolmogorov complexity is really a suitable Idea of complexity Within this context.

Thanks for this, this is de facto fascinating! I am especially intrigued to listen to pushback against it, mainly because I believe it's evidence for brief timelines.

On the other hand, in significant dimensional settings, the majority of the quantity is near the boundary, so it's not an enormous deal. I'm not aware about any work that statements SGD uniformly samples from this boundary, nonetheless it's worth Given that possibility Should the experimental results hold up.

If SGD operates mainly because it's Bayesian, then making it additional Bayesian ought to enable it to be perform superior. But according to that's not the situation. Decreasing the temperature, or using the MAP (=temperature 0) generalizes a lot better than getting the entire Bayesian posterior, as calculated by a pricey MCMC process.

I agree with all your summary. I'm mostly just clarifying what my check out is on the energy and In general role from the Algorithmic Information and facts Principle arguments, because you stated you identified them unconvincing. 

