Searching for specific textual patterns

Depending on what you want to research, it can be very helpful to first employ more basic textual pattern searches on your data before you deploy complex machine learning/AI-based NLP tools. Also, in some studies you might not need to actually use complex NLP tools, and you might be able to get the statistics you need with just searching for textual patterns in your dataset. This can be preferable over more complex research methods, since there is less uncertainty of the results introduced by AI-based tools that are used that have always some unpredictability to their behavior.

Even when you cannot get all the answers you need with searching for textual patterns, it might be a good step to get to know your data better. Do some keywords appear more often than others? Or do some keywords often appear in the same document or passage? What do I need to search for to find what I need? These are questions which might be helpful when you are still constructing the analysis you want to perform to answer your question.

Below we discuss a few approaches to search for specific textual patterns.

Searching using existing search engines

If the dataset is available online, you might be able to search through the dataset using existing search engines such as Google. If you use Google Advanced Search, you can limit your search query to a specific website in order to limit your results to the website via which the dataset is available.

Besides this, datasets might have already their own search tool. For example, all the published Dutch court decisions can also be searched through uitspraken.rechtspraak.nl. If your dataset is available via an API, then this API might also provide functionality to search in the dataset. Be sure to consult the API documentation.

These kinds of search engines generally allow you to match certain specific words or phrases, but generally lack features to allow you to search for specific textual patterns. For this, something more advanced such as regular expressions can help.

Regular expressions

Regular expressions (Wikipedia), also abbreviated as regex or regexp, are a description of some textual pattern. Many programming languages and tools have built-in support to search for matches of such regular expressions. Regular expressions allow you to search for far more complicated patterns of text within a document. For example, when studying the role of the Dutch constitution in parliamentary debates, we used the following pattern:

r"\w*grondwet\w*|\w*constituti\w*"

In order to match any variation on ‘grondwet’, ‘constitutie’ or ‘constitutioneel’, such as:

ongrondwettelijk
grondwettelijk
Grondwet
constitutie
constitutionele
grondwetstoetsing
constitutioneel

To get some experience with regular expressions, you can try the free and interactive online course by RegexOne. Registration is not required and you can get started immediately. Regular expressions are also introduced in the WetSuite NLP Crash Course. Part 3 introduces regular expressions in Python.

One important thing to know about regular expressions is that the specific implementations of regular expressions in different programming languages and tools may differ. This generally means that you need to make sure that you use the documentation of the language or tool you are using to check what the precise syntax is for the regular expressions you develop.

Parsing exact patterns

While regular expressions can be really useful, sometimes the patterns you are looking for are too complex or you want to extract specific information from the pattern. For example, if you would like to find citations to Dutch parliamentary documents (Kamerstukken), you might want to find not only the citation, but also extract to which file and document in that file the citation refers. While such searcher might technically be possible using regular expressions, there’s a fair chance that you get completely lost in an immense expression. In such cases, it might be a far better approach to define a grammar and use a parser to parse the textual patterns you are looking for. One example for such a parser-based approach is the nllegalcit library, which was developed by Martijn Staal for his bachelor thesis to research the usage of parliamentary documents in legal practice (🇳🇱).

What to do with your search results

Extracting textual patterns is something you do with a goal for your research. Two different categories of goals are generally the reason why you are doing this: either to extract metadata of documents, or for statistical purposes.

Extract metadata

Some textual data which you extract from a document can be valuable metadata, which you can then use in further processing or research. For example, you might extract the publication date or references to other documents.

Statistical purposes

You can also compile the results of your search for textual patterns into numeric results. For example, how many ‘matches’ of your textual pattern did you find per document? This can help you to identify outlier documents, or find trends or correlations (if compared to other data about your dataset).

Last updated: 12-Nov-2024