Text Extractor
Extract text from PDF, Word, Excel, CSV, HWP, and HWPX documents.
Upload a Document
Click to select a file or drag and drop here
Supported: PDF, DOCX, XLSX, CSV, HWP, HWPX
Maximum file size: 10MB

Text Extractor Guide
What is the Text Extractor?
The Text Extractor is a tool that extracts text content from various document file formats and converts it into plain text (TXT) or Markdown (MD) format. It operates entirely through your web browser — no software installation required.
■ Supported File Formats
• PDF — Extracts text and tables from digital PDF documents, including advanced coordinate-based table detection for informal table layouts
• DOCX — Extracts paragraphs and tables from Microsoft Word documents while preserving document structure
• XLSX — Reads all sheets from Microsoft Excel workbooks, including multi-sheet support with individual sheet headers
• CSV — Reads comma-separated value files with automatic encoding detection (UTF-8, CP949, EUC-KR) for seamless Korean text support
• HWPX — Extracts text from modern Hangul Word Processor documents (XML-based format) without requiring Hancom Office installation
• HWP — Extracts text from legacy Hangul Word Processor documents (binary format) using specialized binary parsing
This tool is particularly useful when you need to quickly pull text content from documents for repurposing. Common use cases include extracting report content for use in presentations, converting spreadsheet data into text for data processing, pulling text from Korean government documents in HWP format, and creating Markdown versions of PDF tables for use in knowledge management tools.
The extraction preserves table structures when using Markdown output mode, converting document tables into properly formatted Markdown tables with pipe separators and header dividers. This makes it easy to paste extracted content directly into Markdown-compatible editors like Notion, Obsidian, or GitHub.
How to Use
■ Step 1: Upload Your File
Upload a document by clicking the upload area to open a file selection dialog, or simply drag and drop your file onto the upload zone. The upload area will highlight when you drag a file over it to confirm it is ready to accept the drop. Only files with supported extensions (.pdf, .docx, .xlsx, .csv, .hwpx, .hwp) will be accepted.
Once a file is selected, its name and size will be displayed below the upload area. You can remove the selected file by clicking the X button and choose a different one.
■ Step 2: Choose Output Format
Two output formats are available:
① Text — Extracts content as pure plain text. Tables and spreadsheet data are output with tab-separated columns. Best for simple content extraction where structure does not matter.
② Markdown — Extracts content with structural preservation. Tables are converted to pipe-delimited Markdown table syntax with header separators. Multi-sheet Excel files show sheet names as section headers. Best for pasting into Markdown editors or documentation tools.
■ Step 3: Extract Text
Click the 'Extract Text' button to begin processing. The file is uploaded to the server, processed, and the extracted text is returned to your browser. Processing time depends on the file size and complexity — large PDFs with many pages may take several seconds.
■ Step 4: Review and Save
After extraction, the results appear in a preview area with scrollable content. You have two options for saving:
• Copy — Copies the entire extracted text to your clipboard for pasting anywhere
• Download — Saves the result as a .txt file (Text mode) or .md file (Markdown mode)
■ Important Notes
• Uploaded files are immediately deleted from the server after processing
• Scanned PDFs (image-only, without embedded text) cannot be processed — this tool does not include OCR capability
• Very large or complex files may take up to 30 seconds to process
Extraction Methods by Format
■ PDF Extraction
Uses the PyMuPDF library to extract text page by page. In Markdown mode, a sophisticated two-level table detection system operates:
• Level 1 — Structured Tables: Detects explicit table structures defined within the PDF document using PyMuPDF's native find_tables() method
• Level 2 — Grid Table Detection: An advanced coordinate-based algorithm analyzes text span positions to identify visually aligned data that forms informal tables. It clusters text spans by Y-coordinate (within a 12-pixel threshold), detects columns by identifying X-coordinate gaps (minimum 20 pixels), and requires at least 3 columns and 2 consecutive rows to classify a region as a table
All content elements (text blocks, detected tables, structured tables) are sorted by Y-coordinate to maintain proper reading order. Pages are separated by horizontal rule markers (---) in the output.
■ DOCX Extraction
Uses the python-docx library to sequentially extract paragraphs and tables from Word documents. Each paragraph's text is captured, and table cells are read row by row. In Markdown mode, tables are converted to pipe-separated format with proper header separators.
■ XLSX Extraction
Uses the openpyxl library in read-only mode for memory-efficient processing of Excel workbooks. All sheets are read sequentially. Empty rows are filtered out automatically. In Markdown mode, when a workbook contains multiple sheets, each sheet is prefixed with its name as a Markdown heading (## Sheet Name).
■ CSV Extraction
Uses Python's built-in csv module with automatic encoding detection. The system attempts to read the file using three encodings in order: UTF-8, CP949, and EUC-KR. This ensures seamless handling of both international and Korean-encoded CSV files. Empty rows are filtered, and data is output as tab-separated text or Markdown tables.
■ HWPX Extraction
Uses the python-hwpx library to parse modern Hangul Word Processor documents. HWPX files use an XML-based structure, allowing reliable text extraction without requiring Hancom Office installation. Paragraphs are extracted sequentially.
■ HWP Extraction
Uses the pyhwp library to parse legacy binary Hangul Word Processor files. The HWP5 binary format is transformed to UTF-8 text through a specialized TextTransform process. This handles the proprietary binary encoding used in older versions of Hancom Office.
Helpful Tips
■ Use Markdown Format for Spreadsheets
When extracting data from Excel (.xlsx) or CSV files, choose Markdown output for the best results. The table structure is preserved with proper column alignment and header separators, making the output immediately usable in documentation tools. Text format outputs tab-separated values, which can be harder to read and may not display correctly in all applications.
■ Maximize PDF Table Extraction
For PDFs containing tables, always select Markdown format. The coordinate-based grid table detection can identify tables even when they are created using text boxes and drawing objects rather than formal table structures. This is especially useful for government reports and financial documents that often use non-standard table formatting. However, extremely complex layouts with extensive cell merging may not be perfectly reproduced.
■ Working with Korean HWP/HWPX Files
This tool is invaluable when you need to access Hangul Word Processor content without having Hancom Office installed. This is particularly useful on macOS or Linux systems where Hancom Office is not available. For the best extraction accuracy, use HWPX format whenever possible — it is XML-based and provides more reliable parsing than the legacy binary HWP format.
■ Identifying Scanned PDFs
If a PDF extraction returns empty or nearly empty results, the PDF likely contains only scanned images without an embedded text layer. You can verify this by trying to select text in the PDF using a standard PDF viewer — if text cannot be selected, it is a scanned document. In this case, you will need to use OCR software (such as Adobe Acrobat or Tesseract) to add a text layer before using this tool.
■ Handling Large Files
For very large PDFs with many pages, consider splitting the document into smaller sections before uploading. Most PDF viewers allow saving specific page ranges. This not only speeds up processing but also allows you to focus on extracting just the content you need.
■ CSV Encoding Troubleshooting
The automatic encoding detection handles UTF-8, CP949, and EUC-KR seamlessly, which covers the vast majority of CSV files encountered in Korean business contexts. If you encounter a file with a different encoding (such as Shift-JIS for Japanese), convert it to UTF-8 using a text editor before uploading. Most modern text editors (VS Code, Notepad++, Sublime Text) allow you to save files with a specific encoding.
Frequently Asked Questions
Related Calculators
BMI Calculator
Price Lookup by Item
Gas Station Price Lookup
Electricity Bill Calculator
Waste Disposal Info
Nearby Gas Station Map
Car Sharing Zone Lookup
Holiday & Anniversary Calendar
Clock / Stopwatch / World Clock
Postal Code Search
Food Nutrition Lookup
Population & Household Statistics
Public Service (Government Benefits) Lookup
My IP Address
Recipe Lookup
Regional Job & Business Announcements