@mbkriegh Thanks. 🙂
Interesting to hear about the research papers. The benchmarks for language model summarisation are usually collections of research papers, so it would stand to reason that their results there would be more accurate than with most other papers. And it would make sense that the citations were 100% wrong as that's exactly where they're weak.
The worry I would have is, what are the consequences if it's not 100% right, but 100% right 98% of the time and 100% wrong 2% of the time?
Because that's the dynamic with these models. You hit the long tail or an edge case that's just a little bit too novel to it and it goes bonkers, but because of all the other times it worked, you've come to trust it. I'm glad that it gets the citations 100% wrong. That should make people trust it less and they need to be distrustful, because its the 98% right use cases where adopting these tools can do the most damage.