EDUCAÇÃO E TECNOLOGIA

Extract Document details using Document Information Extraction service with ABAP

Dear community,

In this blog I want to show the posibility to extract information from a document using AI and OCR implemented by the BTP service Document Information Extraction calling the API offered from an ABAP program.

The objective of this blog is not to show how the API works as there are good blogs showing it ( Getting Started with Document Information Extraction Trial Service  or Developer Mission) but to show how you can automate the API calls with only ABAP program. I say only abap as there are already another integration scenarios in CPI ( Document Information Extraction Integration with Email Server ) or with iRPA but here we will see a simple solution.

Here the architecture:

The API calls you need to perform to send a file and receive the results are:

  1. Authenticate
  2. Send File
  3. Get job status, if job is still processing the document, wait until it’s done
  4. Get JSON with fields extracted

If you want to test this solution you have to create the Document Information Extraction Service Instance, plese follow this blog from Joni Liu

You need to create a destination in SM59 for the authentication:

Host: d7d51f5atrial.authentication.us10.hana.ondemand.com

Port: 443

User: <clientid from instance service key>

Pass: <clientsecret from instance service key>

And here you have the program that requires a pdf file, it will send the file requesting the fields documentNumber, purchaseOrderNumber and grossAmount and wait for the response. After getting the json it will write the values read by the service.

*&---------------------------------------------------------------------*
*& Report ZTEST_DOCUMENT_INFORMATION_EXT
*&---------------------------------------------------------------------*
*& PoC - Sends a file to Document Information Extraction BTP Service
*& Reads te file from Desktop and sends through API
*&---------------------------------------------------------------------*
REPORT ztest_document_information_ext. CLASS zcl_die DEFINITION DEFERRED. TYPES: BEGIN OF ty_filetab, value TYPE x, END OF ty_filetab. DATA lr_die TYPE REF TO zcl_die.
DATA: lv_file_name TYPE string, lv_rc TYPE i, lt_file TYPE STANDARD TABLE OF ty_filetab, lv_file_content TYPE xstring, lt_filetable TYPE filetable. PARAMETERS: p_fname TYPE rlgrap-filename. AT SELECTION-SCREEN ON VALUE-REQUEST FOR p_fname. CALL METHOD cl_gui_frontend_services=>file_open_dialog EXPORTING window_title = 'Choose a file' file_filter = 'PDF files (*.pdf)|*.pdf|' CHANGING file_table = lt_filetable rc = lv_rc. p_fname = lt_filetable[ 1 ]-filename. **********************************************************************
* Document Information Extraction class definition
CLASS zcl_die DEFINITION FINAL. PUBLIC SECTION. CONSTANTS: c_api_url TYPE string VALUE 'https://aiservices-trial-dox.cfapps.us10.hana.ondemand.com', c_api_path TYPE string VALUE '/document-information-extraction/v1'. DATA: m_oauth TYPE string, m_content_clients TYPE string. METHODS authenticate RETURNING VALUE(rv_authenticated) TYPE abap_bool.. METHODS post_document IMPORTING iv_file_content TYPE xstring RETURNING VALUE(rv_job) TYPE string. METHODS send_file IMPORTING iv_file_content TYPE xstring. METHODS get_status_job IMPORTING iv_job TYPE string RETURNING VALUE(rv_status_job) TYPE string. ENDCLASS. **********************************************************************
* Document Information Extraction class implementation
CLASS zcl_die IMPLEMENTATION. METHOD authenticate. DATA lr_client TYPE REF TO if_http_client. CALL METHOD cl_http_client=>create_by_destination EXPORTING destination = 'ZBTP_DOC_INF_EXT_OAUTH2' IMPORTING client = lr_client EXCEPTIONS argument_not_found = 1 destination_not_found = 2 destination_no_authority = 3 plugin_not_active = 4 internal_error = 5 OTHERS = 6. IF sy-subrc = 0. * If you have the class cl_oauth2_client in your system check note 3041322 or use following method lr_client->request->set_header_field( name = if_http_header_fields_sap=>request_method value = 'POST' ). lr_client->request->set_header_field( name = 'grant_type' value = 'client_credentials' ). lr_client->request->set_header_field( name = if_http_header_fields_sap=>request_uri value = '/oauth/token?grant_type=client_credentials' ). lr_client->send( ). lr_client->receive( ). lr_client->response->get_status( IMPORTING code = DATA(lv_code) ). IF lv_code = '200'. DATA: rest TYPE string. DATA(l_content) = lr_client->response->get_cdata( ). SPLIT l_content AT '"access_token":"' INTO rest l_content. SPLIT l_content AT '"' INTO m_oauth rest. rv_authenticated = abap_true. ELSE. rv_authenticated = abap_false. ENDIF. lr_client->close( ). ENDIF. ENDMETHOD. METHOD post_document. DATA lr_client TYPE REF TO if_http_client. DATA lo_request_part TYPE REF TO if_http_entity. DATA lo_request_part2 TYPE REF TO if_http_entity. DATA lv_content_disposition TYPE string. DATA len TYPE i. DATA lv_options TYPE string. DATA: BEGIN OF ls_create_job_response, id TYPE string, status TYPE string, processedtime TYPE string, END OF ls_create_job_response. CLEAR rv_job. CALL METHOD cl_http_client=>create_by_url EXPORTING url = c_api_url IMPORTING client = lr_client EXCEPTIONS argument_not_found = 1 plugin_not_active = 2 internal_error = 3 OTHERS = 4. IF sy-subrc = 0. lr_client->request->set_header_field( name = if_http_header_fields_sap=>request_method value = if_http_request=>co_request_method_post ). lr_client->request->set_header_field( name = if_http_header_fields_sap=>request_uri value = |{ c_api_path }/document/jobs| ). lr_client->request->set_header_field( name = 'Authorization' value = |Bearer { m_oauth }| ). lr_client->request->set_content_type( if_rest_media_type=>gc_multipart_form_data ). lr_client->request->if_http_entity~set_formfield_encoding( formfield_encoding = cl_http_request=>if_http_entity~co_encoding_raw ). lr_client->request->set_header_field( name = 'Accept' value = if_rest_media_type=>gc_appl_json ). lo_request_part2 = lr_client->request->add_multipart( ). lv_options = '{ "extraction": { "headerFields": [ "documentNumber", "purchaseOrderNumber", "grossAmount" ], "lineItemFields": [ "netAmount" ] },' && '"clientId": "default", "documentType": "invoice", "receivedDate": "2020-02-17", "enrichment": { "sender": { "top": 5, "type": ' && '"businessEntity", "subtype": "supplier" }, "employee": { "type": "employee" } }}'. lo_request_part2->set_header_field( name = `Content-Disposition` "#EC NOTEXT value = |form-data; name="options"; type=application/json| ). lo_request_part2->set_cdata( EXPORTING data = lv_options ). lo_request_part = lr_client->request->add_multipart( ). lv_content_disposition = |form-data; name="file"; filename=sample-invoice.pdf |. lo_request_part->set_header_field( name = `Content-Disposition` "#EC NOTEXT value = lv_content_disposition ). lo_request_part->set_content_type( if_rest_media_type=>gc_appl_pdf ). len = xstrlen( iv_file_content ). lo_request_part->set_data( data = lv_file_content offset = 0 length = len ). lr_client->send( ). lr_client->receive( ). DATA(l_content_clients) = lr_client->response->get_cdata( ). /ui2/cl_json=>deserialize( EXPORTING json = l_content_clients pretty_name = /ui2/cl_json=>pretty_mode-camel_case CHANGING data = ls_create_job_response ). lr_client->response->get_status( IMPORTING code = DATA(lv_code) ). IF lv_code = '201'. rv_job = ls_create_job_response-id. ENDIF. lr_client->close( ). ENDIF. ENDMETHOD. METHOD get_status_job. DATA lr_client TYPE REF TO if_http_client. DATA lv_status_job TYPE string. DATA l_json_response TYPE string. DATA: lr_data TYPE REF TO data. CLEAR rv_status_job. CALL METHOD cl_http_client=>create_by_url EXPORTING url = c_api_url IMPORTING client = lr_client EXCEPTIONS argument_not_found = 1 plugin_not_active = 2 internal_error = 3 OTHERS = 4. IF sy-subrc = 0. lr_client->request->set_header_field( name = if_http_header_fields_sap=>request_method value = if_http_request=>co_request_method_get ). lr_client->request->set_header_field( name = if_http_header_fields_sap=>request_uri value = |{ c_api_path }/document/jobs/{ iv_job }| ). lr_client->request->set_header_field( name = 'Authorization' value = |Bearer { m_oauth }| ). lr_client->send( ). lr_client->receive( ). l_json_response = lr_client->response->get_cdata( ). /ui2/cl_json=>deserialize( EXPORTING json = l_json_response pretty_name = /ui2/cl_json=>pretty_mode-camel_case CHANGING data = lr_data ). lr_client->response->get_status( IMPORTING code = DATA(lv_code) ). IF lv_code = '200'. /ui2/cl_data_access=>create( ir_data = lr_data iv_component = `STATUS`)->value( IMPORTING ev_data = lv_status_job ). IF lv_status_job = 'DONE'. DATA: l_field_name TYPE string, l_value TYPE string, i TYPE i. i = 1. WHILE i < 4. /ui2/cl_data_access=>create( ir_data = lr_data iv_component = |EXTRACTION-HEADER_FIELDS[{ i }]-NAME| )->value( IMPORTING ev_data = l_field_name ). /ui2/cl_data_access=>create( ir_data = lr_data iv_component = |EXTRACTION-HEADER_FIELDS[{ i }]-VALUE| )->value( IMPORTING ev_data = l_value ). WRITE:/ l_field_name, l_value. i = i + 1. ENDWHILE. rv_status_job = lv_status_job. ENDIF. ELSE. rv_status_job = 'FAILED'. ENDIF. lr_client->close( ). ENDIF. ENDMETHOD. METHOD send_file. DATA: l_job TYPE string, l_status_job TYPE string. l_job = lr_die->post_document( iv_file_content ).
* l_job = '1ad442aa-46dc-4e84-8344-d024ec516a18'. IF l_job IS NOT INITIAL. l_status_job = lr_die->get_status_job( iv_job = l_job ). WHILE l_status_job <> 'DONE' AND l_status_job <> 'FAILED'. WAIT UP TO 3 SECONDS. l_status_job = lr_die->get_status_job( iv_job = l_job ). ENDWHILE. ENDIF. ENDMETHOD. ENDCLASS. START-OF-SELECTION. IF p_fname IS NOT INITIAL. * Covert file to binary format CALL METHOD cl_gui_frontend_services=>gui_upload EXPORTING filename = CONV #( p_fname ) filetype = 'BIN' IMPORTING filelength = DATA(lv_input_len) CHANGING data_tab = lt_file. * convert file to XSTRING CALL FUNCTION 'SCMS_BINARY_TO_XSTRING' EXPORTING input_length = lv_input_len IMPORTING buffer = lv_file_content TABLES binary_tab = lt_file. lr_die = NEW zcl_die( ). IF lr_die->authenticate( ) = abap_true. lr_die->send_file( lv_file_content ). ENDIF. ENDIF.

For testing we can use the following invoice  from missions. If we run the program with that pdf, after some seconds you have the following output

We can verify in the Document Information Extraction UI that the extracted that is correct.

With that you can automate the process of scanning documents like invoices, check if it has purchase order number to match the infoice with purchase order, and many other options just in an ABAP program.

Best Regards

Jose Muñoz