Order Table Extraction

In this example, we are a company that sells many different products. Our customer sends us a small table of all the items they want to buy. However, since the customer creates the table, it can look slightly different every time. For example, they write Product ID instead of product_id, Qty instead of Quantity, and the shipping times have no clear structure.

Product ID

Qty

Unit price

Manufacturar

Shipping Time

TZX22-EHZ2

100

2.76$

Tusp

1-2 days

ZUI23-772L6

250

5.00$

1 week

UIUU-13BMW

340'000

0.001$

puma

About a month

QUE2-AIME2

45.56$

lebra

Tomorrow

7KB

OrderTable.docx

Waveline Extract makes it easy to unify these fields into one format we define.

Let's construct a Shape to extract product_id, unit_price, quantity, and shipping_time for each product:

[
  {
    "name": "products",
    "type": "object",
    "description": "All products from the table",
    "isArray": true,
    "elements": [
      {
        "name": "product_id",
        "type": "string",
        "description": "The id of that product. aka product number",
        "isArray": false
      },
      {
        "name": "quantity",
        "type": "number",
        "description": "Quantity of how many units. Aka Qty",
        "isArray": false
      },
      {
        "name": "unit_price",
        "type": "number",
        "description": "Unit price of that product in dollars",
        "isArray": false
      },
      {
        "name": "shipping_time",
        "type": "string",
        "description": "Time it takes to ship this product. (In days)",
        "isArray": false
      }
    ]
  }
]

We can now call the /extract-document endpoint with this shape and the table as the payload to create the job:

curl -X POST "https://waveline.ai/api/v1/extract-document" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -d '{
          "fileName": "OrderTable.txt",
          "contentType": "application/pdf",
          "base64Content": "JVBERi0xLjMKMSAwIG9iago8PC9UeXBlL0NhdGF...",
          "shape": YOUR_SHAPE
        }'

After some time, we query for the result of this job with the job endpoint and get the following in the result field:

{
  "products": [
    {
      "product_id": "TZX22-EHZ2",
      "quantity": "100",
      "unit_price": 2.76,
      "shipping_time": "1-2 days"
    },
    {
      "product_id": "ZUI23-772L6",
      "quantity": "250",
      "unit_price": 5,
      "shipping_time": "1 week"
    },
    {
      "product_id": "UIUU-13BMW",
      "quantity": "340000",
      "unit_price": 0.001,
      "shipping_time": "About a month"
    },
    {
      "product_id": "QUE2-AIME2",
      "quantity": "56",
      "unit_price": 45.56,
      "shipping_time": "Tomorrow"
    }
  ]
}

As we can see above, all the fields have successfully been unified into the format we defined!

PreviousInvoice Extraction NextEmail Extraction

Last updated 1 year ago