PDF 表格提取转为纯文本结构数据

本文详细记录了如何使用读光的 Cycle-CenterNet 有线表格识别模型进行表格结构识别，以及 PaddleOCR 进行文本识别，将 PDF 中的表格图像转为大模型可以理解的表格结构。

项目地址：https://github.com/EvannZhongg/Table_Extraction.git

项目概览

本项目的主要目标是将 PDF 文件中存在的表格，作为图片提取参数信息，尤其是结构化表格。整个流程分为两个阶段：

表格结构识别 —— 使用 Cycle-CenterNet 模型识别表格中每个单元格的位置（多边形坐标）。
OCR 文本识别 —— 使用 PaddleOCR 识别图像中所有文字及其位置坐标。

环境与设置

1. 表格识别模型配置

1 2	git lfs install git clone https://www.modelscope.cn/iic/cv_dla34_table-structure-recognition_cycle-centernet.git

1	model_path = "your_absolute_path_to_cv_dla34_table-structure-recognition_cycle-centernet"

建议自定义路径使用绝对路径

例如：model_path = "D:/Table_Extraction/cv_dla34_table-structure-recognition_cycle-centernet"

2. PaddleOCR 配置

ocr = PaddleOCR(
    use_gpu=True,
    lang='ch',
    det_model_dir='your_absolute_path_to_ch_PP-OCRv4_det_infer',
    rec_model_dir='your_absolute_path_to_ch_PP-OCRv4_rec_infer',
    cls_model_dir='your_absolute_path_to_ch_ppocr_mobile_v2.0_cls_infer'
)

建议自定义路径使用绝对路径

例如： det_model_dir='D:/Table_Extraction/PaddleOCR/models/ch_PP-OCRv4_det_infer/'

核心函数详解

1. 函数 `calculate_iot(cell, text)`

功能：计算 OCR 文本框与表格单元格的交并比（IoT, Intersection over Text）。

原理说明：

利用两组矩形坐标，计算它们重叠区域的面积。
然后用重叠面积 / 文本框面积作为 IoT 值。
若值越大，表示文本越“贴合”单元格。

def calculate_iot(cell, text):
    intersection_x1 = max(cell[0], text['coords'][0])
    intersection_y1 = max(cell[1], text['coords'][1])
    intersection_x2 = min(cell[2], text['coords'][2])
    intersection_y2 = min(cell[3], text['coords'][3])

    if intersection_x1 >= intersection_x2 or intersection_y1 >= intersection_y2:
        return 0.0

    intersection_area = (intersection_x2 - intersection_x1) * (intersection_y2 - intersection_y1)
    text_area = (text['coords'][2] - text['coords'][0]) * (text['coords'][3] - text['coords'][1])
    return intersection_area / text_area

2. 函数 `merge_text_into_cells(cell_coords, ocr_results)`

功能：将 OCR 识别到的文字分配到表格对应单元格中。

原理说明：

对每个单元格遍历所有 OCR 文本框，计算 IoT。
若 IoT > 0.5，则认为该文字属于该单元格。
同时记录那些与所有单元格 IoT < 0.1 的文字（非表格内容）。
将属于单元格的文字合并为一个字符串。

def merge_text_into_cells(cell_coords, ocr_results):
    cell_text_dict = {cell: [] for cell in cell_coords}
    noncell_text_dict = {}

    for cell in cell_coords:
        for result in ocr_results:
            if calculate_iot(cell, result) > 0.5:
                cell_text_dict[cell].append(result['text'])

    for result in ocr_results:
        if all(calculate_iot(cell, result) < 0.1 for cell in cell_coords):
            noncell_text_dict[result['coords']] = result['text']

    merged_text = {}
    for cell, texts in cell_text_dict.items():
        merged_text[cell] = ''.join(texts).strip()
    for coords, text in noncell_text_dict.items():
        merged_text[coords] = ''.join(text).strip()

    return merged_text

3. 函数 `adjust_coordinates(merged_text, image_path)`

功能：将 y 坐标相近的单元格进行聚类并统一化，方便后续行级别排序。

原理说明：

图像越高，容许的 y 偏差越大，使用 height / 100 作为容差。
将 y 值差距小于阈值的单元格归为一组。
每组内统一 y 值为该组的平均 y 值，确保在同一“水平行”内。

def adjust_coordinates(merged_text, image_path):
    image = Image.open(image_path)
    width, height = image.size
    threshold = height / 100
    groups = {}

    for coordinates, text in merged_text.items():
        found_group = False
        for group_y in groups.keys():
            if abs(coordinates[1] - group_y) <= threshold:
                groups[group_y].append((coordinates, text))
                found_group = True
                break
        if not found_group:
            groups[coordinates[1]] = [(coordinates, text)]

    adjusted_coordinates = {}
    for group_y, group_coords in groups.items():
        avg_y = sum(coord[0][1] for coord in group_coords) / len(group_coords)
        for i in group_coords:
            adjusted_coordinates[(i[0][0], avg_y, i[0][2], i[0][3])] = i[1]

    return adjusted_coordinates

4.函数 `draw_text_boxes(image_path, boxes, texts)`

功能：在图像上绘制表格框与文字内容，进行可视化标注。

原理说明：

利用 PIL 创建空白图层，绘制框线与文字。
若文字宽度超出单元格宽度，则使用 textwrap 进行自动换行。

def draw_text_boxes(image_path, boxes, texts):
    img = Image.open(image_path)
    img = Image.new('RGB', img.size, (255, 255, 255))
    draw = ImageDraw.Draw(img)
    font = ImageFont.truetype("./chinese_cht.ttf", size=15)

    for box, text in zip(boxes, texts):
        x0, y0, x1, y1 = box
        x0, x1 = sorted([x0, x1])
        y0, y1 = sorted([y0, y1])
        normalized_box = (x0, y0, x1, y1)
        draw.rectangle(normalized_box, outline='red', width=2)

        text_len = draw.textbbox((x0, y0), text, font=font)
        if (text_len[2] - text_len[0]) > (x1 - x0):
            text = '\n'.join(textwrap.wrap(text, width=int(
                np.ceil(len(text) / np.ceil((text_len[2] - text_len[0]) / (x1 - x0))))))

        draw.text((x0, y0), text, font=font, fill='black')

    img.save('your_image_storage_path/output.png')

5. 最终文本输出（结构化行）

功能：将所有坐标按 y → x 排序，按行归组输出结构化文本。

adjusted_merged_text_sorted = sorted(adjusted_merged_text.items(), key=lambda x: (x[0][1], x[0][0]))
adjusted_merged_text_sorted_group = {}
for coordinates, text in adjusted_merged_text_sorted:
    if coordinates[1] not in adjusted_merged_text_sorted_group:
        adjusted_merged_text_sorted_group[coordinates[1]] = [text]
    else:
        adjusted_merged_text_sorted_group[coordinates[1]].append(text)

for text_list in adjusted_merged_text_sorted_group.values():
    print(' | '.join(text_list))

使用方法

运行脚本选择表格图片：

1	python Table_Extraction.py

或

1	python main.py

在项目目录会输出效果图片，终端会输出文本结果

效果展示

输出结果如下，你可以用这组结果测试大模型是否能读懂这组表格：

PartNumber | TotalCapacitance(Ct) @ 50 V,(pF) | TotalCapacitance(Ct) @ 0 V,(pF) | SeriesResistance (Rs),@10 mA,(②) | MinorityCarrierLifetime (TL)@ 10 mA(ns) | VoltageRating2(M) | I-RegionThickness(μm) | ThermalResistance(0JC)(°C/W)
Maximum | Typical | Maximum | Typical | Minimum | Nominal | Maximum

Switching Applications(continued)
| APD0810-203 | 0.35 | 0.40 | 1.5 | 160 | 100 | 8 | 174
| APD0810-210 | 0.40 | 0.45 | 1.5 | 160 | 100 | 8 | 75
| APD0810-219 | 0.35 | 0.40 | 1.5 | 160 | 100 | 8 | 143
| APD0810-240 | 0.35 | 0.40 | 1.5 | 160 | 100 | 8 | 155
| APD1505-203 | 0.40 | 0.45@10V | 2.5 | 350 | 200 | 15 | 172
| APD1505-210 | 0.40 | 0.45@10 V | 2.5 | 350 | 200 | 15 | 74
| APD1505-219 | 0.40 | 0.45@10V | 2.5 | 350 | 200 | 15 | 142
| APD1505-240 | 0.40 | 0.45 @ 10 V | 2.5 | 350 | 200 | 15 | 150
| APD1510-203 | 0.35 | 0.40 | 2.0 | 300 | 200 | 15 | 168
| APD1510-210 | 0.35 | 0.40 | 2.0 | 300 | 200 | 15 | 70
| APD1510-219 | 0.35 | 0.40 | 2.0 | 300 | 200 | 15 | 137
| APD1510-240 | 0.35 | 0.40 | 2.0 | 300 | 200 | 15 | 149
| APD1520-203 | 0.40 | 0.45 | 1.2 | 900 | 200 | 15 | 155
| APD1520-210 | 0.40 | 0.45 | 1.2 | 900 | 200 | 15 | 57
| APD1520-219 | 0.45 | 0.50 | 1.2 | 900 | 200 | 15 | 124
| APD1520-240 | 0.40 | 0.45 | 1.2 | 900 | 200 | 15 | 136

AttenuatorApplications
| APD2220-203 | 0.45 | 0.50 | 4.0 | 100 | 100 | 50 | 132
| APD2220-210 | 0.45 | 0.50 | 4.0 | 100 | 100 | 50 | 32
| APD2220-219 | 0.40 | 0.45 | 4.0 | 100 | 100 | 50 | 104
| APD2220-240 | 0.40 | 0.45 | 4.0 | 100 | 100 | 50 | 115

补充说明

字体文件 chinese_cht.ttf 必须存在，或替换为系统可识别的中文字体。
对于复杂结构表格，当前模型仍然无法很好的处理。

该项目代码基于 wyf3 和 PaddleOCR

PDF 表格提取转为纯文本结构数据

项目概览

环境与设置

1. 表格识别模型配置

2. PaddleOCR 配置

核心函数详解

1. 函数 calculate_iot(cell, text)

2. 函数 merge_text_into_cells(cell_coords, ocr_results)

3. 函数 adjust_coordinates(merged_text, image_path)

4.函数 draw_text_boxes(image_path, boxes, texts)

5. 最终文本输出（结构化行）

使用方法

补充说明

1. 函数 `calculate_iot(cell, text)`

2. 函数 `merge_text_into_cells(cell_coords, ocr_results)`

3. 函数 `adjust_coordinates(merged_text, image_path)`

4.函数 `draw_text_boxes(image_path, boxes, texts)`