Subtitle
The SubRip file format is described on the
Matroska
multimedia container format website as "perhaps the most basic of all subtitle formats."SubRip (SubRip Text)
files are named with the extension.srt
, and contain formatted lines of plain text in groups separated by a blank line. Subtitles are numbered sequentially, starting at 1. The timecode format used is hours:minutes:seconds,milliseconds with time units fixed to two zero-padded digits and fractions fixed to three zero-padded digits (00:00:00,000). The fractional separator used is the comma, since the program was written in France.
How to load data from subtitle (.srt
) files
Please, download the example .srt file from here.
!pip install pysrt
from langchain.document_loaders import SRTLoader
loader = SRTLoader(
"example_data/Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt"
)
docs = loader.load()
docs[0].page_content[:100]
'<i>Corruption discovered\nat the core of the Banking Clan!</i> <i>Reunited, Rush Clovis\nand Senator A'