TIL

PDF 파일에서 특정 페이지 python으로 추출하기

# fitz 설치
pip install PyMuPDF

def find_pages_with_keyword(input_pdf_path, keyword):
    pdf = fitz.open(input_pdf_path)
    pages_with_keyword = []

    for page_num in range(pdf.page_count):
        page = pdf.load_page(page_num)
        if keyword.lower() in page.get_text().lower():
            pages_with_keyword.append(page_num + 1)

    pdf.close()
    return pages_with_keyword

def print_pages_with_keyword(input_pdf_path, keyword):
    pages = find_pages_with_keyword(input_pdf_path, keyword)

    if len(pages) > 0:
        return pages
        # print(f"키워드 '{keyword}'가 포함된 페이지 쪽수: {', '.join(map(str, pages))}")
    else:
        print(f"키워드 '{keyword}'를 포함하는 페이지를 찾을 수 없습니다.")

# 사용 예시
input_pdf_path = "/content/[NAVER]사업보고서(2023.03.14).pdf"
start_keyword = "4. 재무제표"
end_keyword = "5. 재무제표 주석"

start = print_pages_with_keyword(input_pdf_path, start_keyword)
end = print_pages_with_keyword(input_pdf_path, end_keyword)

def extract_pages(input_pdf_path, output_pdf_path, start_page, end_page):
    pdf = fitz.open(input_pdf_path)
    new_pdf = fitz.open()

    for page_num in range(start_page - 1, end_page):
        new_pdf.insert_pdf(pdf, from_page=page_num, to_page=page_num)

    new_pdf.save(output_pdf_path)
    new_pdf.close()
    pdf.close()

# 사용 예시
output_pdf_path = "/content/네이버_재무제표_2023.pdf"
start_page = start[1]
end_page = end[1] - 1

extract_pages(input_pdf_path, output_pdf_path, start_page, end_page)

input_pdf_path: 사용할 pdf 경로
start_keyword: pdf에서 특정 단어가 들어간 부분의 시작 페이지
end_keyword: pdf에서 특정 단어가 들어간 부분의 마지막 페이지

output_pdf_path: 추출한 pdf 파일명

'TIL' 카테고리의 다른 글

한국 주요 기업 주가 간의 상관관계 분석 (2)	2023.04.22
한국 실업률 데이터를 활용한 스플라인 보간(Spline Interpolation) (0)	2023.04.21
아파치 파켓(Apache Parquet) (0)	2023.04.18
자기상관함수(AutoCovariance Function, ACF) 그래프 확인하기 (0)	2023.04.17
시계열 분해(Time series decomposition) 그래프 확인하기 (0)	2023.04.13

Contents

새소식

PDF 파일에서 특정 페이지 python으로 추출하기

'TIL' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바