The performance of Marmaragan with GPT-4o on the benchmark is promising, with correct annotations having been generated for 50.7% of the benchmark cases. The results establish a foundation for future work on combining the power of LLMs with the reliability of formal software verification.
That’s basically a coin flip. Even if you made it less successful a user would have higher chances of guessing correctly whether an annotation is true of false.
That’s basically a coin flip. Even if you made it less successful a user would have higher chances of guessing correctly whether an annotation is true of false.