This is a useful checklist, it reminded me of a recent topic of 'Whether AI is the end of Scientific Method' on Babbage from Economist Radio[1]
The arguments were that in ML/DL, experiments are run at large scale without hypothesis, with radical empiricism in a trial and error fashion
which is against Scientific Method i.e. Hypothesis, Experiment, Observation, Theory.
I'd say that the scientific method is just a formal process taught to school kids and that most scientists don't follow it either. At least they don't in my field (physics).
It's more like "hypothesis" -> "experiment" -> "uh, this is kind of weird" -> "changes hypothesis to fit the data" -> "take some more data" -> "huh... here's this cool thing unrelated to any of my other hypotheses" -> "switches topic to something more viable".
You are absolutely right, but I don't think the philosophy of science demands that you actually follow the scientific method during discovery mode. What it says, though, is that if you discover any scientific knowledge, its discovery process must be castable in the way of the scientific method. Meaning, when someone else checks or replicates your result, they can actually follow the four step process, and recreate that same knowledge.
Science more than just a collection of facts, is a description of a set of experimental processes that recreate those facts. Which is where the authority-lessness power of it comes from.
for me the key lies in the experiment. only by generating new data through new experiments, do we receive robust theories. otherwise we overfit to whatever data is at hand. hence why we need cross validation when working with canned data.
I don't think reproducibility is about scientific method. It is about trust -- ie is the claim correct and stable. This is also why reproducibility is important in science.
I don't think the scientific method (in the sense of hypothesis->experiment->observation->theory) has anything to do with reproducibility. Frankly, I am not even sure it is necessary for good work, whether it is in ML/DL or in an established basic science. Scientific method is definitely not how most practitioners experience doing research.
The gold standard is h->e->o->t->p->o - you use the theory to create a new prediction that is confirmed by observation thus supporting the theory. This is what is so compelling about 20thC physics - so many insights and consequence from the theory that are then supported by subsequent observation and experiment.
Hypothesis generation is definitely also a common goal of scientific work. There are many situations in which you don't have a good prior to base an experiment on.
"the scientific method" is a somewhat nebulous concept in daily discussions, as evidenced by the guy above defining such that reproducibility is not part of it.
Personally, I would count reproducibility as a fairly important part of the scientific method, as it demonstrates you've achieved the fundamental requirement of understanding your experimental setup.
Without very detailed understanding of your experiment, the experiment isn't scientific, so by extension I would claim that the GP's "scientific method" isn't scientific.
Eliezer Yudkowsky wrote a lot about (many things, including) refining the scientific method in Rationality [0]. IIRC, his point is that the scientific method puts no constraints on what hypotheses you test. He thinks a more orderly search of hypothesis space is preferable.
> with radical empiricism in a trial and error fashion which is against Scientific Method
There is a longer discussion of this issue, with references, in the Stanford Encyclopedia of Philosophy: https://plato.stanford.edu/entries/scientific-method/#SciMet... It is especially worth reading if you believe that trial and error is against the scientific method.
This checklist has some flaws. Most interesting results in ML have no proof.
For example, can you give a proof of superconvergence? What’s the exact learning rate that causes it, and why? Did you know that you can often get away with a high learning rate for a time, and then divergence happens? What’s the proof of that?
Give a proof that under all circumstances and wind conditions, lowering your airplane’s flaps by 5 degrees will help you land safely.
Also, what about datasets that you’re not allowed to release? I personally despise such datasets, but I found myself in the ironic position of having a 10GB dataset dropped in my lap that was a perfect fit for my current project. Unfortunately it wasn’t until after training was mostly complete that we realized we hadn’t asked whether the author was comfortable releasing it, and indeed the answer was no. So what to do? Just don’t talk about it?
I guess the list is good as a set of ideals to aim for. I just wish some consideration was given that you often can’t meet all of those goals.
Most of OpenAI's work would be excluded by this checklist. I don't think anyone would argue that OpenAI doesn't do important work, and that their results are in some sense reproducible.
> Give a proof that under all circumstances and wind conditions, lowering your airplane’s flaps by 5 degrees will help you land safely.
My passing familiarity with aerodynamics and control theory suggests that you could derive a multidimentional shell in parameter space (of wind, temperature, airspeed and other conditions), to form an envelope within which the plane will behave predictably, so that you can prove whether lowering flaps by 5 degrees at a particular point will help you land safely. Accounting for model uncertainty, that envelope would likely be tighter than it could be if we knew our physics better, but that's still far better than a black box ML model that doesn't give you guarantees that similar inputs will lead to similar outputs (there's a mathematical formalism to this whose name escapes me now).
>but that's still far better than a black box ML model that doesn't give you guarantees that similar inputs will lead to similar outputs (there's a mathematical formalism to this whose name escapes me now).
You might be thinking of the K-lipschitz-smoothness or K-lipschitz-continuity of a model for a given norm, i.e. $\Vert f(x)-f(y)\Vert \leq K \Vert x-y\Vert$. A guarantee that K is smaller or equal a certain value is called a Lipschitz certificate . We can by now give this type of guarantee in balls around the training data (coming out of Adversarial example research), but with some limitations and generalization of Lipschitz certificates to test data and/or other norms is pretty bad in general.
I personally think that the term "black box ML model" needs to die with respect to neural networks, the theory work being done has pried open that box sufficiently by now that we can start reasoning about them somewhat. People just generally don't like the answers because it unveils limitations or challenges
The checklist item that asks for a proof is for theoretical claims. Empirical claims are supported by experiment. Theoretical claims must also be supported theoretically.
To put it plainly, if your paper has a "Framework" section that has some text titled "lemma" and "theorem" then you also need some text titled "proof"
Some venues let you put a longer, more detailed proof in the appendix, but the reviewer is free to ignore it for the purpose of evaluating your paper. Anyway, if you have a detailed theorem in the main paper it makes sense to have at least a rudimentary proof in the main paper also.
Most OpenAI's work would be excluded for a good reason.
OpenAI's focus is producing bleeding edge technology demos and being open source AI platform company. They cut corners just to able to show what brute forcing the state of the art can do just now.
This is not critique. It's just a fact that their focus is different from fundamental research.
> For example, can you give a proof of superconvergence? What’s the exact learning rate that causes it, and why? Did you know that you can often get away with a high learning rate for a time, and then divergence happens? What’s the proof of that?
Would Neural Tangent Kernels help with this, or are they just as impenetrable as latent spaces?
Would you say OpenAI's dota 2 results were interesting? They're completely unreproducible according to the checklist: The dataset wasn't released, the code wasn't released, nothing but the results were released.
Ditto for AlphaGo, AlphaZero, and most other AI systems.
If one wants to argue that we shouldn't be doing AI science this way, then that's fine. It's just a different conversation.
It's the other way around. The AI was clearly real and important, but absolutely unreproducible according to the metrics in this submission. GPT-2 is much the same: the dataset was never released, nor was the training code released. Yet would anyone say that it wasn't an important contribution to science?
I admit that it's possible the entire world is currently "doing AI wrong," though. T5 and other new projects have been better with respect to replication. But many datasets are still locked behind paywalls, or only accessible if you have an .edu email address.
Yes, I think many people would say that GPT-2 is not an important contribution to science, and that it rather is the latest focus of industry and lay press hype for machine learning and neural networks. The latest trend, if you will.
Personally, I struggle to see what new thing we learned from GPT-2. Did we learn something about the physical world? About how human minds work? About how language works? It's a language model after all. All we learned is that throwing a large dataset to a hard problem can produce results that are difficult to evaluate.
"Science" means "knowledge". If we haven't learned anything new from GPT-2 then it hasn't contributed to science. It's impressive, like a jetliner is impressive, or an aircraft carrier is impressive, but it's not increasing our body of knowledge about the world and ourselves.
You are arguing (plausibly imo) that reproducibility is not always necessary for successful science. But that is hardly a flaw with a reproducibility (as opposed to 'successful science') checklist.
This is aimed at production or critical applications, though, not forefront or blue-sky research. In the former case, we need a shared & agreed framework to make sure everyone from everywhere gets statistically comparable results, with this checklist helping us in that sense. In the latter case, it is open field and we are looking for agreeable results approximation before a method, which will be devised later to fit concordant results.
I agree with you in principle. I still think reproducible should be a goal even in pure research in blue sky machine learning for the following reasons:
Even basic research is can be sensitive to omitted parameters, setups and starting conditions. Honest mistakes, accidental omissions and failing to spot sensitivity to parameters happens all the time. Writing "A clear explanation of any assumptions" is rarely comprehensive. Its easy to miss things and become blind to some fine details.
Discovering and documenting of a new interesting phenomenon and dynamics is also part of basic research. Publishing experimental discoveries without explanation should be always reproducible.
The arguments were that in ML/DL, experiments are run at large scale without hypothesis, with radical empiricism in a trial and error fashion which is against Scientific Method i.e. Hypothesis, Experiment, Observation, Theory.
[1]https://soundcloud.com/theeconomist/babbage-ai-the-end-of-th...