The GeoClassifer(R) Python algorithm launched back in December 2020 (for petroleum, mining and renewables) can automatically read the ‘body text’ of geoscience, subsurface & wells documentation (PDF, PPT, Word, Excel etc) and:
- Classify by document type
- Classify by document category
- Classify by chronostratigraphy
- Classify by Lithostratigraphy
- Classify by well / borehole name
- Classify by prospect / lead name
- Classify by survey name
- Classify by deposit / orebody name
- Classify by reservoir / aquifer name
- Classify by field name
- Classify by block / license name
- Classify by play name
- Classify by basin / geobody name
- Classify by area /region name
- Classify by country and region
- Categorise by discipline/topics
- Extract dates, people’s names & company names (ootb models)
- Classify to machine learnt topics (custom geoscience model)
- Also extract all of the names above that occur in the document if required
- Extract drilling & operation problems
- Many more features..
The resultant tags can be used to help organise records & document management and improve search & discovery of geoscience, subsurface and wells documents.
The GeoClassifier algorithm achieves this in a unique and novel way using several techniques.
– Knowledge Engineering (a taxonomy with thousands of clues for document types and categories)
– Machine Learning (250,000 labelled topic examples in an ensemble model), custom SpaCy NER models.
– Natural Language Processing (NLP) state-of-the-art geoscience name extraction
There are many limitations and problems when taking a taxonomy or thesaurus built for manual tagging of documents – then trying to apply that automatically to text. Unlike traditional methods (and taxonomies), GeoClassifier(R) was built from the ‘get-go’ for automated not manual document tagging – supporting digital transformation.
The Python algorithm can be applied immediately to diverse documentation, from any geographical location without using prior lists of names. Lists of names (e.g. well names) can be added to improve detection.
The algorithm can run stand alone against files on a filesystem and/or a company can take parts of it and embed in their existing tooling that may be more integrated with SharePoint / EDMS and Search systems.
The algorithm also uses an automatic document scoring system based on a number of criteria to identify those documents that will have tendencies to be ‘most important’ from a search and document & records retention point of view. This can aid file system migration projects, as well as acquisition, divestment and mergers.
More: contact@infosciencetechnologies.com
Patented next generation algorithms: The GeoClassifier(R) algorithm disrupts traditional document classification and extraction whilst OpportunityFinder(R) disrupts traditional business ideation processes, targeting associative extraction of petroleum, mining and renewables concepts and opportunities.
Leave a Reply