[isabelle] Minimizer and preplay - 2 of the many amazing features of Sledgehammer



Hi,

I've spent a lot of time studying how Sledgehammer works with the 25 or so ATPs that can be enabled.

About the time I decide I'm spending too much time experimenting with it, I get some new understanding of its feedback in the output panel, and learn how to use another of its many amazing features.

So I used the "dont_preplay" option, because Sledgehammer was taking time to replay the proofs that the ATPs had found, to test them for its own purposes, which takes time, and the proofs it has found were all working, so I decided it didn't need to test them.

But there was this one metis proof that was staying purple an annoyingly long time as Isabelle clicked through the proofs of my theory, staying purple for 2 or 3 seconds. That's a long time when every other proof is only taking a half a second or less.

The short story is that, not understanding the timing information Sledgehammer was giving me due to preplay, I had picked the 3 second proof instead of one of the half second proofs it had offered.

Eventually, I discovered the use of the timing information due to preplay. Does Sledgehammer return 10 different metis proofs? Now, I don't just pick the one using the shortest named facts, I look at how long Sledgehammer took to preplay the proof.

Okay, I've learned to look at more than that. Are there ten proofs offered? I look at the theorems metis uses and I ask the question, "If I was going to do a detailed proof using the theorems metis uses, which different set of facts used would be the most logically appealing, and require fewer steps?

But due to parallel processing and the timeout setting, there are lots of differences in the proofs Sledgehammer will find.

Sledgehammer was returning 10 or so metis proofs that were using numerous facts, and it was offering to minimize them, which is when Sledgehammer feeds back a proof to the ATP using fewer and fewer facts.

I said, "Hey, manually going through 7 minimize attempts is not what I call automation, so I'll set 'minimize=true'". But there's always a trade off.

After going back and running Sledgehammer on theorems with the preplay option on, for one theorem, it was only finding one proof, rather than the usual many proofs, and it was remote_z3_tptp that was finding the proof.

This brings up the issue of nuance, one being that, given a certain timeout setting, sometimes a remote duplicate of a local prover will return a proof that the local prover doesn't.

I had also started asking the question, "Do I really need z3_tptp in there?" Well, obviously, yes.

Since I wasn't getting lots of proofs, I disabled "minimize" and used "dont_preplay". I then got proofs for z3_tptp, remote_z3_tptp, z3, and remote_z3. The time spent for the 4 provers was like this:

   remote_z3_tptp: ATP real CPU time: 2.15 s.
   z3_tptp:               ATP real CPU time: 29.79 s.
   z3                         SMT solver real CPU time: 2.22 s
   remote_z3          SMT solver real CPU time: 5.75 s.

More nuances. Sometimes the remote prover is faster than the local prover, and sometimes not. Last week, I didn't use remote provers that are duplicates of local provers, but now I generally do.

There's more for me to tell about how to use Sledgehammer's information than most people have time to read.

For example, z3_tptp gave me a proof like this:

by (metis extension_equality unordered_pair_axiom ordered_pair_def) (406 ms)

My extension_equality theorem is merely a biconditional of my extension_axiom. I was asking these questions, "Do I really need extension_equality? Will the proof work using extension_axiom instead, and will it be slower if it does?"

Well, the Sledgehammer output shows you how to use the "min" option, so I tried this, not to minimize the number of facts, but to get the timing information with the axiom substituted for the theorem:

sledgehammer min [z3_tptp, timeout = 90, verbose] (extension_axiom unordered_pair_axiom ordered_pair_def),

and I got this:

Try this: by (metis extension_axiom unordered_pair_axiom ordered_pair_def) (408 ms).

So metis only uses 2 more milliseconds when it uses the axiom directly.

I finish out most of this this email with my maxed out sledgehammer command, and the reasoning behind it.

With Sledgehammer, you need a good process viewer to see what it's doing, and a good CPU meter in your task bar to see if your CPUs are running heavily when they shouldn't be. If you don't do things right, processes that are taking up 25% of your CPU won't get terminated. Doing things right is letting Sledgehammer finish normally, or deleting it before you close jEdit. Sometimes processes just don't get terminated when you don't let Sledgehammer finish.

   Win CPU meter: http://www.hexagora.com/en_dw_davperf.asp
Win Process Explorer: http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx

My long version is this, and I explain what's not arbitrary about it:

sledgehammer[DEL,dont_preplay,minimize,timeout=90,verbose,max_relevant=smart,provers="
    metis remote_vampire e remote_z3 spass_new remote_e_sine z3_tptp
    remote_e_tofof remote_e remote_z3_tptp remote_cvc3 remote_iprover
    remote_iprover_eq remote_waldmeister satallax z3 cvc3 yices"]

1) I use a macro to insert it, and I put the DEL in to cause an error so it won't run. The "DEL,dont_preplay," is highlighted, and I press the backspace key to delete it, after which Sledgehammer starts running.

2) I want "preplay" to get the timing information, but I could decide to not to do "preplay" so that Sledgehammer is devoting more time to only finding proofs, in case it has been timing out and not getting many proofs.

3) I normally want to automate "minimize", but if it's timing out and not getting many proofs, I'll delete it before I run Sledgehammer.

4) I could set "max_relevant=1000" so that Sledgehammer is feeding the ATPs massive facts, but I've learned that if the ATPs have to work too hard, then I'm trying to prove a false statement, or I need to add some preliminary lemmas. Quite possibly, a proof that takes an ATP a long time to find will cause the Isabelle prover to work way too hard and long once it's given the proof.

5) Only a maximum of 4 provers will run in parallel, and they run in the order given, so I put the 7 better provers first, and I have a reasonable timeout set so that the first 4 provers don't hold everything else up too long. Many times the first four will finish fast anyway, and again, if a particular prover isn't proving anything reasonably fast, then it won't or it won't find a decently fast proof.

6) metis is first, because though it can get overloaded and kick out fast, it can also return proofs very fast; metis is run directly by poly.exe.

7) The vampire binary is hard to get, the Windows version doesn't run right, and the remote_vampire is twice as fast as my Linux vampire, but vampire one of the best of the best, so I put it second.

8) "e" is an execllent prover, so I put the local e in the first group. I also put remote_e in with the remote group of provers. Processing nuances could result in remote_e returning a proof that the local e doesn't. Local provers can fail to return proofs that remote provers return because parallel processing is never the same, especially when the options and provers listed are different. But I try these 20 or so provers on the first try to try and save time; different provers find proofs for different theorems, and always because they weren't given enough time. If I don't get such good results on this massive try, then I can start tweaking options.

9) remote_z3 is before z3 because z3, when used directly, won't run in parallel. It will only run after all other provers listed before it have finished. However, it's an execellent prover, which is why I want remote_z3 to start running in the first 4. Additionally, I leave out the "smt" prover; smt calls z3, and z3 used directly finds better proofs, and finds them just as fast.

10) "z3_tptp" uses the z3-prover.exe just like z3, but it's not used the same, so it finds proofs when the others don't, as shown by my example above.

11) I then let Sledgehammer go through all the other remote provers, other than remote_leo2 and remote_satallax. These are HOL provers, and I'm almost exclusively doing FOL. I wonder about leaving remote_iprover and remote_iprover_eq in. They don't prove a lot, but I leave them in.

12) I then do the local satallax. Maybe it's faster or slower at finishing than the remote_satallax, but remote provers can sometimes hold things up; a 60 second timeout might get processed remotely over a period of much more than 60 seconds.

13) z3, cvc3, and yices won't run in parallel with other provers, so I put them last, not that it would matter. If I put them first, they still won't start until the other provers are finished. I put z3 first, because they will run in the order listed.

Thanks to Larry Paulson and Jasmin Blanchette for writing an amazing piece of software. Jasmin has helped me out a lot, and I give you some links to articles Jasmin gave me. He says,

   For many of my papers and in my Ph.D. thesis, I've evaluated E, iProver, SPASS, Vampire, CVC3, Yices, Z3, LEO-II, and Satallax. See e.g.

        http://www21.in.tum.de/~blanchet/jar-smt.pdf  (Section 6)
        http://www21.in.tum.de/~blanchet/testbench.pdf  (Section 5)
        http://www21.in.tum.de/~blanchet/enc_poly_paper.pdf  (Section 6)
        http://www21.in.tum.de/~blanchet/itp2012-spass.pdf  (Section 7)
        http://www21.in.tum.de/~blanchet/iwil2010-sledgehammer.pdf  (Section 3)

   and of course the original "Judgement Day" paper by Böhme&  Nipkow:

        http://www21.in.tum.de/~boehmes/judgement/judgement.pdf

   Hence I have a very good idea, based on thousands of data points, of how well the different provers work.

There's also the amazing Nitpick. I tried to use my own primitive equality operator, and Nitpick found a counterexample for both a theorem and its negation. I don't know how long it would have taken for bad news to show up without Nitpick, because my primitive equal was working fine for theorems I was proving.

I could go on about the possibilities that have started to occur to me, but I'll save that for another day. After reading some of the papers above, it's a case of, "Dude, Larry and Jasmin started thinking these things through a long time before you got these premature ideas on how to use Sledgehammer".

But prior to using Sledgehammer, I had this idea that detailed proofs and automatic proofs were two different paths. It's not that way at all. Sledgehammer is indicative of the future. Do you want an automatic proof without all the details? The automatic tools will try to give it to you. Do you want to provide a detailed proof? The automatic tools will also be a big part of that. You'll provide the detailed steps, and the automatic tools will prove your steps, they'll provide information to clearly show the justification, and not require that you to have a database of 10000 theorem names in your memory; the 10000 names probably not being in your memory, because you're not the person who developed the 10000 theorem libary.

And the automatic tools will give you all sorts of useful information along the way, like, "You can prove it with these three theorems, if know how to provide the details", or, "You can prove it with these two other theorems, if you can figure out how", which is pretty much what Sledgehammer does now when it uses metis and smt.

TIP: After sledgehammer finishes, highlight all the text in the output panel, paste it in another jEdit file, and do a hypersearch on "Try this:". That lets you look at all the information for all the proof offers, and you can experiment with the commands. If you change the options for Sledgehammer or remove one of the provers, you could get significantly different results on the next run of Sledgehammer for the same theorem, but you still have the data from the former run until you don't want it any more.

I put my macro below that inserts the command and highlights "DEL,dont_preplay,", so I only have to hit the backspace key to delete "dont_preplay" and start Sledgehammer.

Regards,
GB



SearchAndReplace.setAutoWrapAround(false);
SearchAndReplace.setIgnoreCase(false);
SearchAndReplace.setRegexp(false);
SearchAndReplace.setSearchFileSet(new CurrentBufferSet());

textArea.setSelectedText("sledgehammer[DEL,dont_preplay,minimize,timeout=90,verbose,max_relevant=smart,provers=\"");
textArea.insertEnterAndIndent();
textArea.setSelectedText("metis remote_vampire e remote_z3 spass_new remote_e_sine z3_tptp");
textArea.insertEnterAndIndent();
textArea.setSelectedText("remote_e_tofof remote_e remote_z3_tptp remote_cvc3 remote_iprover");
textArea.insertEnterAndIndent();
textArea.setSelectedText("remote_iprover_eq remote_waldmeister satallax z3 cvc3 yices\"]");

SearchAndReplace.setSearchString("DEL,dont_preplay,");
SearchAndReplace.setReverseSearch(true);
SearchAndReplace.find(view);

SearchAndReplace.setSearchString("DEL,dont_preplay,");
SearchAndReplace.setReverseSearch(false);
SearchAndReplace.find(view);















This archive was generated by a fusion of Pipermail (Mailman edition) and MHonArc.