Infoscience Technologies Ltd is a tech start-up providing Text Analytics Research & Development to companies seeking to extract, discover and exploit geoscience knowledge from text.
Target industries range from geological surveys, petroleum exploration, economic mining through to geohealth and space exploration.
The company’s unique selling point is its deep geoscience domain knowledge, combined with computer science, data science, business analysis and international award winning research.
The Company currently licenses two methods (algorithms), GEOSAPIEN and GEOCLASSIFIER both related to predicting geoscience entities or classes in unstructured text.
GEOSAPIEN® Patent Pending.
A Method and System for Generating Geological Lithostratigraphical Analogues using Theory-Guided Machine Learning from Unstructured Text.
Over 60% of geoscientists believe the identification of global analogues reduces risk and improves decision making in areas of sparse data (Sun and Wan 2002). They also help to challenge bias (Rose 2016). This is applicable in sectors such as petroleum, economic mining, geohealth and space exploration.
Analogues can uncover subtle opportunities that may not be apparent from any other technique or technology, with the ability to make global comparisons “casting a wider net than just what is close by‘ identified as particularly useful. Traditional Information Retrieval (IR) search engines are not well suited to find similar cases or analogues (Mukherjee et al 2013).
Manually curated analogue databases are useful, but as put by Greenberg (2011) manually created taxonomies and knowledge representations can blind us to new information discoveries. With information in reports in volumes too vast for geoscientists to read (Big Data), data driven techniques that can scan through vast amounts of text may surface differentiating insights.
Geoscience information has high dimensionality, geoscience bodies are not as clear cut as objects in some other domains, rare events are of interest and geoscience processes can show long-memory characteristics in time (Karpatne et al 2017). Some research (Cleverley 2017) points to issues where unsupervised machine learning from text overfits and ignores the wealth of knowledge we have in geoscience. Anchoring unsupervised machine learning frameworks with high level knowledge of what geoscientists find most useful for analogues may be preferable to avoid such overfitting issues.
The GEOSAPIEN® method uses a number of Geoscience Theories to inform the automated manipulation of text using Natural Language Processing (NLP) techniques in company reports and/or external literature, ‘boosting the geoscience signal’ before input into a text embedding model (such as word2vec, GLoVe, FastText, ELMo, BERT). These use effectively complex word co-occurrence and deep learning to define a word by its co-occurring words and the words that co-occur with those and so forth (vector). A key aspect is that GEOSAPIEN does this without significantly biasing or blinding us to new discoveries.
This allows a geoscientist to ask questions such as “What are good analogues for color changes in muddy contourites?”, the system does not return lists of documents, but instead returns answers (in the form of geological lithostratigraphic units) ranked by similarity to the geoscientist’s question. It does this by learning from company/external papers and reports that most closely match this context using vectors (not keywords).
In another use case, a geoscientist may also be working on a particular geological formation (such as the Tasraft Formation) and wish to know “What other formations around the world are good analogues for this formation? and I don’t want Palaeozoic ones“. The system can return the most similar Lithostratigraphic units it can find ranked by similarity to the Tasraft Formation vector (subtracting the vector for Palaeozoic).
The more text put into the model, the more accurate it is likely to be, with the GEOSAPIEN method providing a geoscience framework for machine learning – producing more useful geoscience analogues than pure unsupervised methods.
There are a number of ways to use and exploit GEOSAPIEN:
- Companies can license the Intellectual Property (IP) method and build their own technology systems (models and user interfaces) in their pipelines (or incorporate it into what they already have) with advice from Infoscience Technologies if needed.
- Companies can ask Infoscience Technologies to supply the technology and apply it to their text (and/or 3rd party subscription text the company has the rights to perform text and data mining on), so licensing a technology and IP.
- Variants on the above.
GEOCLASSIFIER™ Patent Pending.
A method and system for using geoscience and commercial sentence classification patterns for text summarisation.
Over 30,000 sentences (1 million+ features) from public domain petroleum geoscience literature and presentations were painstakingly labelled in a novel way, using a bespoke ontology of geoscience and commercial labels developed by Infoscience Technologies Ltd. A Machine Learning (ML) classifier was built from these training data, enabling the prediction of the topic for any sentence in petroleum geoscience reports. From tectonostratigraphy to company deals, and from source rocks to biostratigraphy.
This is a superior method than applying rule based taxonomies to ‘classify’, as critically it takes into account the immediate and wider context of the sentence. This context is left behind when taxonomies or thesauri are created. It also avoids the ‘smoothing’ applied by approaches that simply attempt to classify the whole document to one type.
This is used to create a contextual profile ‘DNA’ that takes into account prior and post sentence classifications. This applies to not only any document, but also topics within any search result list or corpus as a whole. This enables the automation of text summaries for petroleum geoscience in a way which is different to those that currently exist, more specifically tuned to a geoscientist or commercial analyst and better able to surface subtle patterns and items of interest.
The binary classification model can be licensed and easily plugged into a company’s existing Natural Language Processing (NLP) pipeline.
Alternatively, the labelled data set can be licensed to use itself for transfer learning, allowing internal company labelled examples to be combined with this industry set to fine tune if required.
As text is parsed, any sentence can simply be fed into the binary ML model in one line of Python code, returning predicted classification(s). Combining and visualizing all these predictions can enable geoscientists to make better sense of what is being returned by search engines in lists too long to read. They also increase the propensity for a system to show a geoscientist something ‘they don’t already know’.
For more information contact: firstname.lastname@example.org
Unlocking the hidden codes in geoscience unstructured text