A Python Natural Language Processing Toolkit for Many Human Languages
Stanza | Python-based NLP Toolkit
This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.
AI has no doubt come a long way, but there are concepts that are still work in progress. Natural language processing (NLP), a technique that employs computational techniques for the purpose of learning, understanding, and producing human language content is one of those AI subfields that still has a long way to go.
Well, there have been advances and growing development of computational approaches to study human languages driven by the availability of open-source natural language processing (NLP) toolkits such as CoreNLP, FLAIR, spaCy1, and UDPipe. However, they suffer from several limitations.
Challenges of Existing NLP Toolkits
- Existing NLP toolkits only support several major languages which significantly limit the community’s ability to process multilingual text.
- The toolkits are also sometimes under-optimized for accuracy, potentially misleading downstream applications and insights obtained from them.
- Furthermore, they assume input text has been annotated with other tools, lacking the ability to process raw text with a unified framework. The result is a limit for wide applicability to text from diverse sources.
Introducing Stanza NLP Toolkit
In a paper published recently, researchers introduce Stanza, an open-source Python natural language analysis package. Stanza contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities.
Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
Researches say they trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested.
Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionalities to cover other tasks such as coreference resolution and relation extraction.
Stanza’s neural pipeline is not only in its wide coverage of human languages but also accurate on all tasks, thanks to its language-agnostic, fully neural architectural design.
Stanza: Features and Benefits
- From raw text to annotations — Stanza features a fully neural pipeline that takes raw text as input, and produces annotations including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
- Multilinguality — Stanza’s architectural design is language-agnostic and data-driven, which allows us to release models supporting 66 languages, by training the pipeline on the Universal Dependencies (UD) treebanks and other multilingual corpora
- State-of-the-art performance — evaluated on a total of 112 datasets, Stanza’s neural pipeline adapts well to the text of different genres, achieving state-of-the-art or competitive performance at each step of the pipeline.
- Native Python implementation requiring minimal efforts to set up
- Pretrained neural models supporting 66 (human) languages;
- A stable officially maintained Python interface to CoreNLP.
Source code, documentation, and pre-trained models for 66 languages are publicly available here.
Read more: A Python Natural Language Processing Toolkit
Thank you for reading! I value your comments and shares and would love to connect on Twitter, LinkedIn, and Facebook. For updates on the most recent and interesting Machine Learning research papers out there, subscribe to AI Scholar Weekly. Please 👏 if you enjoyed this article. Cheers!