TK TaskKit
PDF Tools

Extract Text

Pull readable text out of a PDF. The file is parsed locally; we never receive a byte.

Files stay on your device. PDFs are read, merged, and saved entirely inside your browser. We never receive a byte of your file.

What this tool does

This extractor pulls the textual content out of a PDF. Drop a file, optionally toggle "preserve line breaks", and you get the raw text on the right — with each page introduced by a small marker (--- Page N ---). The text can be copied to the clipboard or downloaded as a UTF-8 .txt file. Parsing happens entirely in your browser via pdf.js; the PDF is never uploaded.

When you'd use it

  • Searching for a phrase across a long report, contract, or thesis without paying for a desktop suite.
  • Feeding a PDF's body into a summariser, translator, or LLM prompt that wants plain text.
  • Pulling the abstract or references out of an academic paper to paste into a citation manager.
  • Exporting chat or email transcripts that arrived as PDFs into a format your scripts can grep.
  • Verifying the OCR layer on a scanned document — the absence of recognisable text here means the PDF is image-only and needs OCR first.

How it works

pdf.js exposes a getTextContent() method on each page that returns the page's text runs in the order they appear in the document's content stream, along with each run's transformation matrix (a six-element array whose 5th index is the y-coordinate). With the toggle off, items are joined with single spaces — fast and consistent, but newlines that the original PDF expressed visually are lost. With Preserve line breaks turned on, the tool walks the items and inserts a \n whenever the y-coordinate jumps by more than ~5 pixels — a heuristic that recovers paragraph and line boundaries surprisingly well on text-based PDFs.

Pages are separated with a literal --- Page N --- marker so when you paste the result into a search box or grep it, you can still find the page you want. Encoding is UTF-8, which means Latin diacritics, Cyrillic, CJK, and Arabic all round-trip without mojibake.

The pdf.js worker is bundled as a same-origin asset emitted by Vite at build time. There is no third-party CDN call, no telemetry, and no copy of your text retained after the tab closes.

Notes

Why is the result empty for some PDFs? PDFs come in two flavours: those with a real text layer (the kind exporters like Word, LaTeX, and Chrome produce) and those that are images of pages (scans, faxes, some old reports). pdf.js can only extract from the first kind. If the panel returns nothing, your PDF is image-only and you need OCR — try a tool like Tesseract.js or a desktop OCR pipeline before extracting again.

Why do columns sometimes interleave? PDFs don't store paragraphs; they store runs of glyphs at specific coordinates. Two-column layouts can look fine on screen but give us interleaved runs because the items appear in stream order, not visual order. The y-coordinate heuristic helps in single-column documents; for tricky multi-column PDFs, you may need to manually clean up the output.

Are tables preserved? Cells come through as text but the column structure does not — there's no concept of "this cell aligns with that cell" in the underlying stream. For tabular data, look for a CSV export from the source if possible.

Is the result the same as "View → Select all → Copy" in a PDF reader? Close, but not identical. Reader-level copy uses the OS's text-handling stack, which sometimes adds spaces and breaks differently. This tool stays close to pdf.js's raw item stream, with one optional newline heuristic on top.

Related tools