Extract text from PDF
Based on two Python libraries, pymupdf and camelot.
pymupdfcan extract text and create a screenshot of defined area of a PDF page. It can also extract images.camelotis good at identifying tables without visible borders.camelotidentifies the coordinates of the table and then usespymupdfto create a screenshot of the table area.
For converting pdf to text, I first create a visual environment
conda create -name pdf2text python=3.11 -y
Then I install the two libraries in the environment:
conda activate pdf2text
pip install pymupdf camelot-py