The burstiness and perplexity are used by our detector along with several other features of the document. They help explain why our model made its decision, although they are one part of the whole story.
You can interpret the perplexities per sentence as a measure of how likely an AI model would have chosen the exact same set of words as found in the document. One aspect of GPTZero’s algorithm uses an AI model similar to language models like ChatGPT to measure the perplexity of the given document. We’ve trained the AI model to identify when the input text looks very similar to something written by a language model.
For example, the sentence, “Hi there, I am an AI _” would most likely be continued by an AI model with the word “assistant”, which would have low perplexity. On the other hand, if the next word that followed was “detector”, then that sentence would have much higher perplexity, and also a greaterlikelihood of being written by a human. Over the course of hundreds of words, these probabilities compound to give us a clear picture of the origin of this document.
There isn’t an absolute scale for perplexity, but generally, a perplexity above 85 is more likely than not from a human source. For a more technical definition of this measure, you can take a look at this guide: Perplexity in Language Models. Evaluating language models using the… | by Chiara Campagnola | Towards Data Science
Burstiness is a measure of how much the perplexity varies over the entire document. One significant fingerprint of language models is that they write with a very consistent level of AI-likeness. While a person could easily write an AI-like sentence by accident, people tend to vary their sentence construction and diction a lot throughout a document. On the other hand, models formulaically use the same rule to choose the next word in the sentence, leading to low burstiness. Thus, greater burstiness tends to indicate that the document is more human-like. While there is no set threshold for burstiness, it is taken as one element of several in GPTZero’s model.