Complex Geoscience Knowledge Graphs from Unstructured Text

The OpportunityFinder® algorithm automatically produces complex geoscience Knowledge Graph networks from unstructured text.

The algorithm uses ‘DNA profiling inspired’ techniques to populate the graph.

This enables interesting patterns and new knowledge to be surfaced that are beyond the reach of other approaches.

To find out more information and arrange a presentation or pilot contact:

OpportunityFinder update – digitally transforming Geoscience Opportunity Generation workflows using Natural Language Processing (NLP)

OpportunityFinder Milestone: the geoscience Python algorithm is now trained on 25,000 terms & phrases for specifically identifying clues for source rock, maturation, migration, reservoir, hydrocarbon occurrence, trap and seal in unstructured text (reports, presentations and papers).

This is used by its one-of-a-kind pattern based discovery method to assist the Geoscientist and surface possible leads, opportunities, analogues and plays that may have been overlooked.

OpportunityFinder®: A codebreaker for geoscience unstructured text


Bletchley Park Bombe (replica of the original Bombe) Antoine Taveneaux CC BY-SA 3.0

Over the past few years, geoscience and data science knowledge was used to label over one million diverse geoscience sentences from public domain Internet sources (papers, reports, presentations etc.).

The purpose was to identify clues for source rock, maturation, migration, hydrocarbon occurrence, reservoir, trap and seal as mentioned in unstructured text; in such a way they could be used for automated inference. This included both obvious explicit terms and phrases, along with more subtle non-obvious textual clues.

These data are used with Natural Language Processing, Machine Learning and a First-of-a-Kind novel ‘DNA inspired’ method to create a predictive classifier. The algorithm (OpportunityFinder®) can surface non-obvious patterns of interest that may be useful to an Exploration Geoscientist.

These may be contained within any repository of reports, documents or text, too large for a person to ever read. This may include old hardcopy reports now scanned/digitized, those in different languages, external and internal to an organization. The resulting patterns, which are surfaced from trillions of permutations, can be displayed in time and space to assist the Geoscientist with discovery and ideation.

The image below in Fig 1 is a simulation of data extracted from a large body of reports and geo-referenced using OpportunityFinder®. The pie-charts represent the differing elements that have been discovered in text (e.g. potential trap/seal clues in green).


Fig 1 – Thematic play elements from text (public domain WMS Basins data) 

These allow the Geoscientist to drill down in more detail. These raw ‘DNA’ are used by the data driven pattern algorithm in OpportunityFinder® to surface potential plays, leads and opportunities that may not be obvious. These may be browsed by the geoscientist stimulating lines of thought that may not have necessarily occurred had it not been for the algorithm.

This may provide a ‘fast start’ to organizations and aid companies with geoscience exploration. The algorithm (Python) can plug-in to existing search & discovery approaches used by organizations, who can also fork their own version of OpportunityFinder® should that be required.

There are also opportunities to target a variety of geological themes not currently addressed should that be of interest.

Update on OpportunityFinder – detecting petroleum exploration geoscience opportunities in text


The Python based algorithm has been further developed over the past 3 months with significant expansion in a number of areas:

  • Machine Learning: Refinement of the machine learning models using over 4,000 labelled sentences
  • Taxonomy: Expansion of the petroleum geoscience play element lexicon / taxonomy (source rock, maturation, migration, reservoir, trap, seal and hydrocarbon occurrence) to over 15,000 terms and phrases. The most extensive in the entire industry for this specific area.
  • Natural Language Processing (NLP): State-of-the art inductive data driven Geotagging for geoscience and geographical entities using Natural Language Processing (NLP) rules. State-of-the-at inductive data driven Lithostratigraphy extraction using NLP rules.
  • Ranking: An 11 dimension ranking method for ‘non-obviouness’ or ‘interestingness’ for potential opportunities, tested on Geoscientists. Driven by the philosophy to show Geoscientists ‘something they don’t already know’.
  • Patent Pending method: Further optimized in Python with unique hierarchical hash-table approach. On one public domain petroleum geoscience collection of several hundred thousands sentences the algorithm finished in just under 5 minutes on a cheap $500 i5 laptop after completing over 50 Billion possible permutations/ computations.

This enables any company to take a collection of reports, papers (any unstructured text) and almost immediately create high quality structured data (potential knowledge) that can be displayed spatially on a map using existing applications (Fig 1). This complements the “long list” of document search results lists produced by the traditional search engine.


Fig 1 – Fast delivery of value: exploiting unstructured information.

The information above (Fig 1) has been created entirely from unstructured text in geoscience reports. Areas in red indicate hydrocarbon occurrence, black is source rock, greens are trap/seal, yellow is reservoir, greys indicate evidence for migration/maturation. These data are 1 of 3 outputs that OpportunityFinder produces.

Whilst there is plenty of scope to build bespoke user interfaces, the advantages of OpportunityFinder is the delivery of data from unstructured text which is ready to load into existing company applications. Not just extracting data, but surfacing patterns that could lead to new business opportunities.

OpportunityFinder is available to license yearly or a one-off perpetual license that allows a company to fork its own version that it then subsequently owns. Components of OpportunityFinder can also be plugged into existing technologies a company may license/have developed – to supercharge initiatives to make search ‘intelligent’.

Quick Proof of Concepts (PoC) can be done remotely for a ‘fast start’ to support a company’s existing digital transformation initiatives.

Next steps include creating a version of OpportunityFinder for the mining industry.

For more information please contact:

About Infoscience Technologies Ltd

The tech start-up was founded by Dr Paul Cleverley ( in Nov 2018 based in Oxford (UK) with a focus on Artificial Intelligence (AI) applied to unstructured text.

The company can also work as a remote R&D lab with existing in-house innovation teams, building proprietary algorithms and technologies.

Target sectors range from geological surveys, petroleum exploration, metals and mining through to hydrogeology, geohealth, renewables and space exploration.

The company’s unique selling point is its deep geoscience domain knowledge, combined with computer science, data science, business analysis and internationally recognised research.


Geoscience Digital Transformation – Text Mining Presentation 23rd March 2020, Geol Society of London


Looking forward to presenting at the Finding Petroleum conference on the 23rd March 2020 at the Geological Society of London.


Most of the published literature on text mining in exploration geoscience focuses on extraction of data or concepts typically in the sentence or document ‘container’. There are no known approaches that look for combinatorial patterns within and across documents, that may indicate an opportunity (such as a hydrocarbon play, lead or idea). A ‘DNA profiling’ inspired data science technique combining machine learning and natural language processing will be presented. These techniques may facilitate data driven ideation for the geoscientist and stimulate a line of thought that would not have otherwise occurred had it not been for the algorithm.

Predicting Hydrocarbon Play Types from Unstructured Text

Predicting hydrocarbon plays from text using machine learning and natural language processing. I recently tested the OpportunityFinder Algorithm on a selection of public domain geoscience literature. Only literature published between 1990 to 2010 was used, some time before a major gas discovery was made in the area. The hypothesis was whether the algorithm could surface the existence of a ‘play type’ and supporting evidence way in advance to its impending discovery.

The algorithm surfaced the play type of [RESERVOIR]-[TRAP]-[SEAL] of Miocene Shallow Water Limestones in Atoll-like reef structures capped by thick salt in the area. Similar in nature to what was subsequently found through exploration. Evidence of gas was picked up through seafloor pockmarks, present where the salt was absent or via faults. There were no pockmarks at the vicinity of the discovery, showing evidence for potentially a good seal and gas trapped below the salt. This perhaps hints as to what may be achieved through these types of text mining techniques.

No claims are made that algorithms can provide an ‘x’ marks the spot. AI is generally unimaginative and lacks the retroductive reasoning of a geoscientist. However, what algorithms can do, is ‘read’ more reports & papers than a person can feasibly read in a lifetime; joining the dots to surface subtle potentially interesting patterns to spark ideas. These suggestions may point the geoscientist towards a line of thinking they may not have had otherwise. We may be just scratching at the surface of what is possible.