Extracting Key-Value Pairs from a Form Document - Amazon Textract

Extracting Key-Value Pairs from a Form Document

The following Python example shows how to extract key-value pairs in form documents from Block objects that are stored in a map. Block objects are returned from a call to AnalyzeDocument. For more information, see Form Data (Key-Value Pairs).

You use the following functions:

  • get_kv_map – Calls AnalyzeDocument, and stores the KEY and VALUE BLOCK objects in a map.

  • get_kv_relationship and find_value_block – Constructs the key-value relationships from the map.

To extract key-value pairs from a form document
  1. Configure your environment. For more information, see Prerequisites.

  2. Save the following example code to a file named textract_python_kv_parser.py. In the function get_kv_map, replace profile-name with the name of a profile that can assume the role and region with the region in which you want to run the code.

    import boto3 import sys import re import json from collections import defaultdict def get_kv_map(file_name): with open(file_name, 'rb') as file: img_test = file.read() bytes_test = bytearray(img_test) print('Image loaded', file_name) # process using image bytes session = boto3.Session(profile_name='profile-name') client = session.client('textract', region_name='region') response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS']) # Get the text blocks blocks = response['Blocks'] # get key and value maps key_map = {} value_map = {} block_map = {} for block in blocks: block_id = block['Id'] block_map[block_id] = block if block['BlockType'] == "KEY_VALUE_SET": if 'KEY' in block['EntityTypes']: key_map[block_id] = block else: value_map[block_id] = block return key_map, value_map, block_map def get_kv_relationship(key_map, value_map, block_map): kvs = defaultdict(list) for block_id, key_block in key_map.items(): value_block = find_value_block(key_block, value_map) key = get_text(key_block, block_map) val = get_text(value_block, block_map) kvs[key].append(val) return kvs def find_value_block(key_block, value_map): for relationship in key_block['Relationships']: if relationship['Type'] == 'VALUE': for value_id in relationship['Ids']: value_block = value_map[value_id] return value_block def get_text(result, blocks_map): text = '' if 'Relationships' in result: for relationship in result['Relationships']: if relationship['Type'] == 'CHILD': for child_id in relationship['Ids']: word = blocks_map[child_id] if word['BlockType'] == 'WORD': text += word['Text'] + ' ' if word['BlockType'] == 'SELECTION_ELEMENT': if word['SelectionStatus'] == 'SELECTED': text += 'X ' return text def print_kvs(kvs): for key, value in kvs.items(): print(key, ":", value) def search_value(kvs, search_key): for key, value in kvs.items(): if re.search(search_key, key, re.IGNORECASE): return value def main(file_name): key_map, value_map, block_map = get_kv_map(file_name) # Get Key Value relationship kvs = get_kv_relationship(key_map, value_map, block_map) print("\n\n== FOUND KEY : VALUE pairs ===\n") print_kvs(kvs) # Start searching a key value while input('\n Do you want to search a value for a key? (enter "n" for exit) ') != 'n': search_key = input('\n Enter a search key:') print('The value is:', search_value(kvs, search_key)) if __name__ == "__main__": file_name = sys.argv[1] main(file_name)
  3. At the command prompt, enter the following command. Replace file with the document image file that you want to analyze.

    python textract_python_kv_parser.py file
  4. When you're prompted, enter a key that's in the input document. If the code detects the key, it displays the key's value.