Special tokens and whitespaces

The or \u2581 character in the tokens is used by our tokenizer to substitute a single whitespace or tab symbol. Sequences of 2 and 4 consecutive spaces (either regular whitespaces or tabs) have their own tokens, ▁▁ and ▁▁▁▁, respectively.

Note that since tokenization adds a dummy space at the start of each line for consistency. The resulting text is not simply a concatenation of all tokens with replaced with a space. For example:

>>> res = requests.post("...", json={"prompt": "This is the 1st line\nThis is the 2nd line", 
                                     "temperature": 0, "maxTokens": 16})
>>> res.status_code
200
>>> data = res.json()
>>> data['completions'][0]['data']['text']
'\nThis is the 3rd line\nThis is the 4th line\nThis is the 5th line\n'
>>> tokens = [t['generatedToken']['token'] for t in data['completions'][0]['data']['tokens']]
>>> "".join(tokens)
'<|newline|>▁This▁is▁the▁3rd▁line<|newline|>▁This▁is▁the▁4th▁line<|newline|>▁This▁is▁the▁5th▁line<|newline|>'
>>> "".join(tokens).replace("▁"," ").replace("<|newline|>", "\n")
'\n This is the 3rd line\n This is the 4th line\n This is the 5th line\n'

Each token's textRange field can be used to map it to its corresponding span in the result text. Note that the text field of the prompt in the response may differ from the text sent in the request if it contains special symbols that behave differently after tokenization. In this case, the textRange fields always refer to the text in the response.