Text Segmentation

Split text into segments by function or topic

Overview

Segment text into coherent and readable units, based on distinct topics and paragraphs. This allows for easy breakdown of long text into manageable chunks of text. The API supports both raw text and URLs of webpages as input sources. See the section description for more details.

You can pass in text directly, including HTML tags, or a URL to a file in text, HTML, or PDF format. The API renders the linked pages first and extracts the visible text. For HTML pages, only the main page content is processed; additional site navigation elements or footers are ignored.

Successful response members:

  • id: ID of this segmentation request.
  • segments: List of segments found in the submitted text.
    • segmentText: The text of this segment, extracted from the original text.
    • segmentType: The function of this segment in the document. The following values are supported:
      • normal_text: Normal content text.
      • title: Main title of the document.
      • h1, h2, h3, h4, h5, h6: A heading in an HTML or PDF document.
      • foot_note: A footnote.
      • other: Some other category, such as a references section, bibliography, citations, or not able to be categorized for some other reason.
      • normal_text_short: Short block of text. Might be too short to summarize.
      • normal_text_long: Long block of text.
      • non_english: Non-english content that couldn't be categorized.

Examples

from ai21 import AI21Client
client = AI21Client(api_key=<<AI21_API_KEY>>)

from ai21.models.document_type import DocumentType
def segmentation():
    response = client.segmentation.create(
        source="https://en.wikipedia.org/wiki/Blue_whale",
        source_type=DocumentType.URL
    )
    for i in response.segments:
        print(f"Segment type: {i.segment_type}, Length: {len(i.segment_text)} chars")

    
# Response
Segment type: normal_text, Length: 732 chars
Segment type: normal_text, Length: 652 chars
Segment type: normal_text, Length: 455 chars
Segment type: h2, Length: 8 chars
Segment type: h3, Length: 12 chars
...
import requests
ROOT_URL = "https://api.ai21.com/studio/v1/"

def segment():
  url = ROOT_URL + "segmentation"
  response = requests.post(
      url,
      headers={"Authorization": f"Bearer {AI21_API_KEY}"},
      json={
         "source":"https://en.wikipedia.org/wiki/Blue_whale",
         "sourceType":"URL"
      }
  )

  for i in response.json()["segments"]:
    print(f"Segment type: {i['segmentType']}, Length: {len(i['segmentText'])} chars")
   
 
# Response
Segment type: normal_text, Length: 732 chars
Segment type: normal_text, Length: 652 chars
Segment type: normal_text, Length: 455 chars
Segment type: h2, Length: 8 chars
Segment type: h3, Length: 12 chars
...
Language
Credentials
Header