banner
hughie

hughie

热爱技术的小菜鸟, 记录一下所学所感

HanLP - a multilingual natural language processing toolkit for production environments

Introduction#

This article provides a brief introduction to HanLP.

HanLP is a multilingual natural language processing toolkit that supports Chinese and Pinyin simplification and conversion, text processing, and semantic analysis.

17-hanlp


Main Content#

1. What is HanLP#

HanLP is a multilingual natural language processing toolkit for production environments. It supports various tasks such as Chinese word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, constituent parsing, semantic dependency parsing, semantic role labeling, coreference resolution, style conversion, semantic similarity, new word discovery, keyword and phrase extraction, automatic summarization, text classification and clustering, Pinyin simplification and conversion, and more. HanLP is known for its comprehensive functionality, high accuracy, efficiency, up-to-date corpus, clear architecture, and customizability.

2. HanLP Features#

HanLP is characterized by its comprehensive functionality, high accuracy, efficiency, up-to-date corpus, clear structure, and strong customizability. With the support of the world's largest multilingual corpus, HanLP 2.1 supports ten joint tasks and multiple single tasks for 130 languages, including Simplified Chinese, Traditional Chinese, English, Japanese, Russian, French, and German. HanLP has pre-trained dozens of models for more than a dozen tasks and continues to iterate on the corpus and models.

3. HanLP Usage and Training#

HanLP provides two types of APIs: RESTful and native, catering to lightweight and large-scale scenarios, respectively. Regardless of the API or programming language used, HanLP maintains semantic consistency in its interface and adheres to open-source principles in its code. It can run on CPUs and is recommended for GPUs/TPUs. To install the PyTorch version, the input unit for the Native API is a sentence, which needs to be segmented using a multilingual sentence segmentation model or a rule-based sentence segmentation function. The semantic design of both the RESTful and native APIs is completely consistent, allowing seamless interchangeability for users.

The native API is essentially local Python execution:

# pip install hanlp
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH)
# Choose the desired task, such as using 'tok' for tokenization and 'ner' for named entity recognition. Each task has multiple models to choose from, corresponding to different natural language standards.
hanlp_outs = HanLP(t, tasks=['tok/fine','ner/msra'])

4. Conclusion#

Recently, while working on related subtasks, I researched HanLP and found that it has extensive functionality and performs well. However, it requires parameter adjustments based on specific projects.

That being said, traditional NLP projects have undoubtedly been greatly influenced since the release of ChatGPT. Essentially, ChatGPT can accomplish tasks that traditional NLP can do, and even more simply and with more functionality (which can be a headache for NLP practitioners).

Nevertheless, traditional NLP still has its uses, and for various reasons, not all projects can leverage ChatGPT or other LLM models.


Lastly#

References:

Official Website


Disclaimer#

This article is solely for personal learning purposes.

This article is synchronized with HBlog.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.