LLMs: Data Extraction is the Real Cool Use Case

LLMs: Data Extraction is the Real Cool Use Case

Not discussed as other applications, it is a game-changer
Not discussed as other applications, it is a game-changer

Oct 26, 2023

Why

Being able to extract organized information from a mass of text is like discovering an untouched treasure.

Large Language Models have captured the interest of many because they facilitate text generation and semantic retrieval at unprecedented levels. This article highlights a less glamorous but equally thrilling use case—one that I'm personally super excited about.

Over the last 40 years, we've become adept at using, managing, and analyzing structured text through SQL, and we've gotten fairly good at dealing with semi-structured text, like JSON files, in the past 10-15 years using noSQL databases. Unstructured text represents the next frontier, an untapped data source that's likely more expansive than the other two combined

How

There are three main categories for harnessing this untapped information.

  1. Summarization: This involves going through large amounts of text that are impossible to retain in memory, sometimes even to read in a single lifetime. To be more precise, summarization keeps the text unstructured but transforms it from a volume that is difficult for humans to process into a more manageable amount.

  2. Categorization: Also known as clustering or classification, this process assigns tags to data. A tag could denote a review as positive or negative for a customer review, assign a topic to a document or news article, or label characteristics like age, gender, or profession of a person in an interview, or even a location for a job candidate.

  3. Structurization: This is the transformation of text into structured (like SQL tables) or semi-structured (such as JSON or YAML) formats. Once the text is transformed, you can apply familiar analyses, like running an SQL query over it.

Are you excited? Or do you feel that was too abstract? Let's dive into some examples!

Interviews Analysis for a No-Profit

Problem

We have helped a non-profit and dived into a massive amount of text from a ton of interviews, tens and hundreds of thousands across different projects. Once written down, these interviews became a huge pile of words just waiting to be made sense of. Trying to go through all that by hand to find the pain points people talked required a lot of time. We wanted to automatically pull out these issues from the conversations so that we could make sure none was missed by analysts and we could classify them in categories.

Important to notice is that "pain point" is very general concept. Nevertheless LLMs can extract this information.

Approach

LLMs are pretty smart at picking through text, figuring out the tough spots in the interviews. We set out to make a list of these issues, pinpointing exactly where they were mentioned.

Let's see an example to be more clear:

Interview Excerpt: "...It's been quite the journey working here. I really appreciate how well we all get along. But sometimes we hit a snag when we don't have what we need, which can stop us from coming up with cool new stuff. And, you know, the way we're all split up into different teams can make it tricky to get things done on time..."

Output:

  • "issue: "Not enough resources"

    "citation": "...sometimes we hit a snag when we don't have what we need, which can stop us from coming up with cool new stuff..."


  • "issue": "Team separation"

    "citation": "...the way we're all split up into different teams can make it tricky to get things done on time..."

It is not trivial to get this format out of LLMs. It involves either advanced parsing of the generated text, finetuning of the models or use a combination of OpenAI API functions and Pydantic.

The above would probably be considered an example of unstructured to semi-structured transformation. Unstructured to structured (SQL-table like) is also possible. Using similar techniques you could ask for example an LLM to automatically generate a table with NAME, ROLE, IS_POSITIVE, SCANDAL_LEVEL columns (or whatever is important for you) out of news articles.

More Concrete Examples

Warning: some of the below is technical. For example, if you don't know what a knowledge graph is, feel free to skip that section.

Text Classification

  • Problem: Segregating public feedback into predefined categories.

  • Input: "The newly constructed public library has become a beacon of knowledge in our community. However, the lack of parking space is a concern."

  • Output: Positive, Negative or Neutral


Citations Retrieval

  • Problem: Sifting through city council transcripts for exact legal citations. Similar to no-profit example above.

  • Input: "...pursuant to the provisions stipulated in Article 45(3) of the Urban Planning Act, the council hereby approves the construction..."

  • Output: Citation: Article 45(3) of the Urban Planning Act


PII Data Sanitization

  • Problem: Shielding sensitive intel within public documents.

  • Input: "Case Report: Subject: John Doe, DOB: 01/01/1980, SSN: 123-45-6789, reported a theft incident on 12/12/2023. The subject reported …"

  • Output: Sanitized Text

    "Case Report: Subject: [Redacted], DOB: [Redacted], SSN: [Redacted], reported a theft incident on 12/12/2023. The subject reported…"


Entity Extraction and Resolution

  • Problem: Discerning and resolving entities from public documents.

  • Input: "Complainant Jane Doe, residing at 456 Elm St, reported a noise complaint on 11/11/2023 against the nightclub at 789 Oak St."

  • Output: Entities

    {Name: Jane Doe, Address: 456 Elm St, Date: 11/11/2023}

    {Name: Nightclub, Address: 789 Oak St}


Knowledge Graphs Generation

  • Problem: Creating a knowledge graph to showcase the history and cultural importance of a renowned museum.

  • Input: "The Louvre, originally a royal palace, is the world's largest art museum and a historic monument in Paris, France. It is most famous for being the home of the Mona Lisa, a seminal work of Renaissance art. The museum also houses a vast collection of Egyptian antiquities and ancient Greek and Roman sculptures."

  • Output: Knowledge Graph

    {Node: The Louvre, Edge: Function, Node: Art Museum},

    {Node: The Louvre, Edge: Location, Node: Paris, France},

    {Node: The Louvre, Edge: Previous Function, Node: Royal Palace},

    {Node: The Louvre, Edge: Contains, Node: Mona Lisa},

    {Node: Mona Lisa, Edge: Era, Node: Renaissance},

    {Node: The Louvre, Edge: Collection, Node: Egyptian Antiquities},

    {Node: The Louvre, Edge: Collection, Node: Greek Sculptures},

    {Node: The Louvre, Edge: Collection, Node: Roman Sculptures}.


Hopefully you are now excited about this LLMs superpower few people talk about. If you had an idea for a process in your company or you simply would like to have a discussion about the topic, feel free to reach us out at info@duenders.com!


Why

Being able to extract organized information from a mass of text is like discovering an untouched treasure.

Large Language Models have captured the interest of many because they facilitate text generation and semantic retrieval at unprecedented levels. This article highlights a less glamorous but equally thrilling use case—one that I'm personally super excited about.

Over the last 40 years, we've become adept at using, managing, and analyzing structured text through SQL, and we've gotten fairly good at dealing with semi-structured text, like JSON files, in the past 10-15 years using noSQL databases. Unstructured text represents the next frontier, an untapped data source that's likely more expansive than the other two combined

How

There are three main categories for harnessing this untapped information.

  1. Summarization: This involves going through large amounts of text that are impossible to retain in memory, sometimes even to read in a single lifetime. To be more precise, summarization keeps the text unstructured but transforms it from a volume that is difficult for humans to process into a more manageable amount.

  2. Categorization: Also known as clustering or classification, this process assigns tags to data. A tag could denote a review as positive or negative for a customer review, assign a topic to a document or news article, or label characteristics like age, gender, or profession of a person in an interview, or even a location for a job candidate.

  3. Structurization: This is the transformation of text into structured (like SQL tables) or semi-structured (such as JSON or YAML) formats. Once the text is transformed, you can apply familiar analyses, like running an SQL query over it.

Are you excited? Or do you feel that was too abstract? Let's dive into some examples!

Interviews Analysis for a No-Profit

Problem

We have helped a non-profit and dived into a massive amount of text from a ton of interviews, tens and hundreds of thousands across different projects. Once written down, these interviews became a huge pile of words just waiting to be made sense of. Trying to go through all that by hand to find the pain points people talked required a lot of time. We wanted to automatically pull out these issues from the conversations so that we could make sure none was missed by analysts and we could classify them in categories.

Important to notice is that "pain point" is very general concept. Nevertheless LLMs can extract this information.

Approach

LLMs are pretty smart at picking through text, figuring out the tough spots in the interviews. We set out to make a list of these issues, pinpointing exactly where they were mentioned.

Let's see an example to be more clear:

Interview Excerpt: "...It's been quite the journey working here. I really appreciate how well we all get along. But sometimes we hit a snag when we don't have what we need, which can stop us from coming up with cool new stuff. And, you know, the way we're all split up into different teams can make it tricky to get things done on time..."

Output:

  • "issue: "Not enough resources"

    "citation": "...sometimes we hit a snag when we don't have what we need, which can stop us from coming up with cool new stuff..."


  • "issue": "Team separation"

    "citation": "...the way we're all split up into different teams can make it tricky to get things done on time..."

It is not trivial to get this format out of LLMs. It involves either advanced parsing of the generated text, finetuning of the models or use a combination of OpenAI API functions and Pydantic.

The above would probably be considered an example of unstructured to semi-structured transformation. Unstructured to structured (SQL-table like) is also possible. Using similar techniques you could ask for example an LLM to automatically generate a table with NAME, ROLE, IS_POSITIVE, SCANDAL_LEVEL columns (or whatever is important for you) out of news articles.

More Concrete Examples

Warning: some of the below is technical. For example, if you don't know what a knowledge graph is, feel free to skip that section.

Text Classification

  • Problem: Segregating public feedback into predefined categories.

  • Input: "The newly constructed public library has become a beacon of knowledge in our community. However, the lack of parking space is a concern."

  • Output: Positive, Negative or Neutral


Citations Retrieval

  • Problem: Sifting through city council transcripts for exact legal citations. Similar to no-profit example above.

  • Input: "...pursuant to the provisions stipulated in Article 45(3) of the Urban Planning Act, the council hereby approves the construction..."

  • Output: Citation: Article 45(3) of the Urban Planning Act


PII Data Sanitization

  • Problem: Shielding sensitive intel within public documents.

  • Input: "Case Report: Subject: John Doe, DOB: 01/01/1980, SSN: 123-45-6789, reported a theft incident on 12/12/2023. The subject reported …"

  • Output: Sanitized Text

    "Case Report: Subject: [Redacted], DOB: [Redacted], SSN: [Redacted], reported a theft incident on 12/12/2023. The subject reported…"


Entity Extraction and Resolution

  • Problem: Discerning and resolving entities from public documents.

  • Input: "Complainant Jane Doe, residing at 456 Elm St, reported a noise complaint on 11/11/2023 against the nightclub at 789 Oak St."

  • Output: Entities

    {Name: Jane Doe, Address: 456 Elm St, Date: 11/11/2023}

    {Name: Nightclub, Address: 789 Oak St}


Knowledge Graphs Generation

  • Problem: Creating a knowledge graph to showcase the history and cultural importance of a renowned museum.

  • Input: "The Louvre, originally a royal palace, is the world's largest art museum and a historic monument in Paris, France. It is most famous for being the home of the Mona Lisa, a seminal work of Renaissance art. The museum also houses a vast collection of Egyptian antiquities and ancient Greek and Roman sculptures."

  • Output: Knowledge Graph

    {Node: The Louvre, Edge: Function, Node: Art Museum},

    {Node: The Louvre, Edge: Location, Node: Paris, France},

    {Node: The Louvre, Edge: Previous Function, Node: Royal Palace},

    {Node: The Louvre, Edge: Contains, Node: Mona Lisa},

    {Node: Mona Lisa, Edge: Era, Node: Renaissance},

    {Node: The Louvre, Edge: Collection, Node: Egyptian Antiquities},

    {Node: The Louvre, Edge: Collection, Node: Greek Sculptures},

    {Node: The Louvre, Edge: Collection, Node: Roman Sculptures}.


Hopefully you are now excited about this LLMs superpower few people talk about. If you had an idea for a process in your company or you simply would like to have a discussion about the topic, feel free to reach us out at info@duenders.com!