Can this tool extract text from scanned PDFs?

No, this tool extracts only embedded digital text from PDF documents. Scanned PDFs that contain only images without a text layer cannot be processed because the tool does not include OCR (Optical Character Recognition) capability. To extract text from scanned documents, you would first need to process them through OCR software such as Adobe Acrobat Pro, ABBYY FineReader, or the open-source Tesseract engine. Once the OCR software has added a text layer to the PDF, you can then use this tool to extract the resulting text.

What is the difference between Text and Markdown output?

Text output produces pure plain text where table columns are separated by tab characters. This format is best for simple content extraction or when you plan to process the output programmatically. Markdown output preserves table structures using pipe-delimited (|) Markdown table syntax with proper header separators, making it render beautifully in any Markdown-compatible viewer or editor such as Notion, Obsidian, GitHub, or VS Code. Choose Markdown when you want to maintain document structure, especially for files containing tables.

Is my uploaded file stored on the server?

No, uploaded files are deleted immediately after text extraction is complete. Each file is saved temporarily with a unique identifier (UUID) in the server's temp directory and is automatically removed in the cleanup process regardless of whether the extraction succeeds or fails. The extracted text result is transmitted to your browser and is not retained on the server. No file content, metadata, or extraction results are logged or stored after processing.

Why do some Excel cells appear empty in the extracted output?

The Excel extractor reads stored calculation values (data_only mode) rather than evaluating formulas at extraction time. If cells contain formulas that were never calculated and saved by Excel — for example, if the file was created programmatically or last saved by a library that does not evaluate formulas — the stored values may be empty. To resolve this, open the file in Microsoft Excel or LibreOffice Calc, allow it to recalculate all formulas, save it, and then upload the saved version. This ensures all formula results are stored in the file.

Can I process HWP and HWPX files without Hancom Office installed?

Yes, both HWP (legacy binary format) and HWPX (modern XML-based format) files can be fully processed without Hancom Office installation. HWPX files are parsed using the python-hwpx library which reads the XML structure directly, while HWP files are processed using the pyhwp library which interprets the proprietary binary format. That said, HWPX provides higher extraction accuracy due to its structured XML foundation. If you have the option, saving your Hangul documents in HWPX format before uploading will yield the best results.

Text Extractor

Extract text from PDF, Word, Excel, CSV, HWP, and HWPX documents.

Upload a Document

Click to select a file or drag and drop here

Supported: PDF, DOCX, XLSX, CSV, HWP, HWPX

Maximum file size: 10MB

Output Format

Text Extractor Guide

What is the Text Extractor?

The Text Extractor is a tool that extracts text content from various document file formats and converts it into plain text (TXT) or Markdown (MD) format. It operates entirely through your web browser — no software installation required. ■ Supported File Formats • PDF — Extracts text and tables from digital PDF documents, including advanced coordinate-based table detection for informal table layouts • DOCX — Extracts paragraphs and tables from Microsoft Word documents while preserving document structure • XLSX — Reads all sheets from Microsoft Excel workbooks, including multi-sheet support with individual sheet headers • CSV — Reads comma-separated value files with automatic encoding detection (UTF-8, CP949, EUC-KR) for seamless Korean text support • HWPX — Extracts text from modern Hangul Word Processor documents (XML-based format) without requiring Hancom Office installation • HWP — Extracts text from legacy Hangul Word Processor documents (binary format) using specialized binary parsing This tool is particularly useful when you need to quickly pull text content from documents for repurposing. Common use cases include extracting report content for use in presentations, converting spreadsheet data into text for data processing, pulling text from Korean government documents in HWP format, and creating Markdown versions of PDF tables for use in knowledge management tools. The extraction preserves table structures when using Markdown output mode, converting document tables into properly formatted Markdown tables with pipe separators and header dividers. This makes it easy to paste extracted content directly into Markdown-compatible editors like Notion, Obsidian, or GitHub.

How to Use

■ Step 1: Upload Your File Upload a document by clicking the upload area to open a file selection dialog, or simply drag and drop your file onto the upload zone. The upload area will highlight when you drag a file over it to confirm it is ready to accept the drop. Only files with supported extensions (.pdf, .docx, .xlsx, .csv, .hwpx, .hwp) will be accepted. Once a file is selected, its name and size will be displayed below the upload area. You can remove the selected file by clicking the X button and choose a different one. ■ Step 2: Choose Output Format Two output formats are available: ① Text — Extracts content as pure plain text. Tables and spreadsheet data are output with tab-separated columns. Best for simple content extraction where structure does not matter. ② Markdown — Extracts content with structural preservation. Tables are converted to pipe-delimited Markdown table syntax with header separators. Multi-sheet Excel files show sheet names as section headers. Best for pasting into Markdown editors or documentation tools. ■ Step 3: Extract Text Click the 'Extract Text' button to begin processing. The file is uploaded to the server, processed, and the extracted text is returned to your browser. Processing time depends on the file size and complexity — large PDFs with many pages may take several seconds. ■ Step 4: Review and Save After extraction, the results appear in a preview area with scrollable content. You have two options for saving: • Copy — Copies the entire extracted text to your clipboard for pasting anywhere • Download — Saves the result as a .txt file (Text mode) or .md file (Markdown mode) ■ Important Notes • Uploaded files are immediately deleted from the server after processing • Scanned PDFs (image-only, without embedded text) cannot be processed — this tool does not include OCR capability • Very large or complex files may take up to 30 seconds to process

Extraction Methods by Format

■ PDF Extraction Uses the PyMuPDF library to extract text page by page. In Markdown mode, a sophisticated two-level table detection system operates: • Level 1 — Structured Tables: Detects explicit table structures defined within the PDF document using PyMuPDF's native find_tables() method • Level 2 — Grid Table Detection: An advanced coordinate-based algorithm analyzes text span positions to identify visually aligned data that forms informal tables. It clusters text spans by Y-coordinate (within a 12-pixel threshold), detects columns by identifying X-coordinate gaps (minimum 20 pixels), and requires at least 3 columns and 2 consecutive rows to classify a region as a table All content elements (text blocks, detected tables, structured tables) are sorted by Y-coordinate to maintain proper reading order. Pages are separated by horizontal rule markers (---) in the output. ■ DOCX Extraction Uses the python-docx library to sequentially extract paragraphs and tables from Word documents. Each paragraph's text is captured, and table cells are read row by row. In Markdown mode, tables are converted to pipe-separated format with proper header separators. ■ XLSX Extraction Uses the openpyxl library in read-only mode for memory-efficient processing of Excel workbooks. All sheets are read sequentially. Empty rows are filtered out automatically. In Markdown mode, when a workbook contains multiple sheets, each sheet is prefixed with its name as a Markdown heading (## Sheet Name). ■ CSV Extraction Uses Python's built-in csv module with automatic encoding detection. The system attempts to read the file using three encodings in order: UTF-8, CP949, and EUC-KR. This ensures seamless handling of both international and Korean-encoded CSV files. Empty rows are filtered, and data is output as tab-separated text or Markdown tables. ■ HWPX Extraction Uses the python-hwpx library to parse modern Hangul Word Processor documents. HWPX files use an XML-based structure, allowing reliable text extraction without requiring Hancom Office installation. Paragraphs are extracted sequentially. ■ HWP Extraction Uses the pyhwp library to parse legacy binary Hangul Word Processor files. The HWP5 binary format is transformed to UTF-8 text through a specialized TextTransform process. This handles the proprietary binary encoding used in older versions of Hancom Office.

Helpful Tips

■ Use Markdown Format for Spreadsheets When extracting data from Excel (.xlsx) or CSV files, choose Markdown output for the best results. The table structure is preserved with proper column alignment and header separators, making the output immediately usable in documentation tools. Text format outputs tab-separated values, which can be harder to read and may not display correctly in all applications. ■ Maximize PDF Table Extraction For PDFs containing tables, always select Markdown format. The coordinate-based grid table detection can identify tables even when they are created using text boxes and drawing objects rather than formal table structures. This is especially useful for government reports and financial documents that often use non-standard table formatting. However, extremely complex layouts with extensive cell merging may not be perfectly reproduced. ■ Working with Korean HWP/HWPX Files This tool is invaluable when you need to access Hangul Word Processor content without having Hancom Office installed. This is particularly useful on macOS or Linux systems where Hancom Office is not available. For the best extraction accuracy, use HWPX format whenever possible — it is XML-based and provides more reliable parsing than the legacy binary HWP format. ■ Identifying Scanned PDFs If a PDF extraction returns empty or nearly empty results, the PDF likely contains only scanned images without an embedded text layer. You can verify this by trying to select text in the PDF using a standard PDF viewer — if text cannot be selected, it is a scanned document. In this case, you will need to use OCR software (such as Adobe Acrobat or Tesseract) to add a text layer before using this tool. ■ Handling Large Files For very large PDFs with many pages, consider splitting the document into smaller sections before uploading. Most PDF viewers allow saving specific page ranges. This not only speeds up processing but also allows you to focus on extracting just the content you need. ■ CSV Encoding Troubleshooting The automatic encoding detection handles UTF-8, CP949, and EUC-KR seamlessly, which covers the vast majority of CSV files encountered in Korean business contexts. If you encounter a file with a different encoding (such as Shift-JIS for Japanese), convert it to UTF-8 using a text editor before uploading. Most modern text editors (VS Code, Notepad++, Sublime Text) allow you to save files with a specific encoding.