• Previous Article
    A mathematical analysis for the forecast research on tourism carrying capacity to promote the effective and sustainable development of tourism
  • DCDS-S Home
  • This Issue
  • Next Article
    Collaborative filtering recommendation algorithm towards intelligent community
August & September  2019, 12(4&5): 823-836. doi: 10.3934/dcdss.2019055

Uyghur morphological analysis using joint conditional random fields: Based on small scaled corpus

1. 

Xinjiang Technical Institute of Physical and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China

2. 

University of Chinese Academy of Sciences, Beijing 100049, China

3. 

Institute of Mathematics and Information of Hotan Teachers College, Hotan 848000, China

* Corresponding author: Ghalip Abdukerim

Received  June 2017 Revised  October 2017 Published  November 2018

As a fundamental research in the field of natural language processing, the Uyghur morphological analysis is used mainly to determine the part of speech (POS) and segmental morphemes (stem and affix) of a word in a given sentence, as well as to automatically annotate the grammatical function of the morphemes based on the context. It is necessary to provide various information for other tasks of natural language processing including syntactic analysis, machine translation, automatic summarization, and semantic analysis, etc. In order to increase the morphological analysis efficiency, this paper puts forward a hybrid approach to create a statistical model for Uyghur morphological tagging through a small-scale corpus. Experimental results show that this plan can obtain an overall accuracy of 92.58 % with a limited training corpus.

Citation: Ghalip Abdukerim, Eziz Tursun, Yating Yang, Xiao Li. Uyghur morphological analysis using joint conditional random fields: Based on small scaled corpus. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 823-836. doi: 10.3934/dcdss.2019055
References:
[1]

B. Aisha and M. Sun, A statistical method for Uyghur tokenization, in International Conference on Natural Language Processing and Knowledge Engineering, (2009), 1-5. doi: 10.1109/NLPKE.2009.5313764.  Google Scholar

[2]

Uyghur Language, Available from: https://en.wikipedia.org/wiki/Uyghur_language. Google Scholar

[3]

S. Dandapat, S. Sarkar and A. Basu, Automatic part-of-speech tagging for bengali: An approach for morphologically rich languages in a poor resource scenario, in ACL 2007, Proceedings of the Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, 2007. Google Scholar

[4]

T. Ibrahim and B. Yuan, A survey on minority language information processing research and application in xinjiang, Journal of Chinese Information Processing, 6 (2011), 149-156.   Google Scholar

[5]

T. Klymchuk, Regularizing algorithm for mixed matrix pencils, Applied Mathematics and Nonlinear Sciences, 2 (2017), 123-130.   Google Scholar

[6]

O. Kohonen, S. Virpioja, L. Leppanen and K. Lagus, Semi-supervised extensions to morfessor baseline, Proceedings of the Morpho Challenge 2010 Workshop, 2010. Google Scholar

[7]

T. Kudo, K. Yamamoto and Y. Matsumoto, Applying conditional random fields to japanese morphological analysis, in Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A Meeting of Sigdat, A Special Interest Group of the Acl, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, 6 (2004), 230-237. Google Scholar

[8]

Lafferty, D. John, McCallum, Andrew, Pereira and C. N. Fernando, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, 2001. Google Scholar

[9]

T. Litip, The possibility of handling phonetic harmony by computer in Uyghur, Journal of the Central University for Nationalities, 5 (2004), 108-113.   Google Scholar

[10]

A. MairehabaW.-B. JiangZ.-Y. WangY. Tuergen and Q. LIU, Directed graph model of Uyghur morphological analysis, Journal of Software, 12 (2012), 3115-3129.  doi: 10.3724/SP.J.1001.2012.04205.  Google Scholar

[11]

A. MijitN. GrahamM. MasatoM. ShinsukeK. Tatsuya and H. Askar, Uyghur Morpheme-based Language Models and ASR, Ipsj Sig Notes, (2010), 581-584.  doi: 10.1109/ICOSP.2010.5656065.  Google Scholar

[12]

M. OrhunA. C. eyd Tantug and A. Esref, Rule Based Analysis of the Uyghur Nouns, International Journal on Asian Language Processing, 1 (2009), 33-44.   Google Scholar

[13]

L. Tohti, Modern Uyghur Reference Grammar, China Social Science Press, Beijing, 2012. Google Scholar

[14]

E. TursunD. GangulyT. OsmanY. YatingG. AbdukerimZ. Junlin and L. Qun, A semisupervised Tag-Transition-Based markovian model for Uyghur morphology analysis, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16 (2016), 8-23.  doi: 10.1145/2968410.  Google Scholar

[15]

A. Wumaier, T. Yibulayin, Z. Kadeer and S. Tian, Conditional random fields combined fsm stemming method for uyghur, in IEEE International Conference on Computer Science and Information Technology, (2009), 295-299. doi: 10.1109/ICCSIT.2009.5234727.  Google Scholar

[16]

H. XueY. YangT. OsmanX. Li and R. Zhang, Uyghur word segmentation using a combination of rules and statistics, Advances in information Sciences and Service Sciences(AISS), 3 (2011), 105-113.   Google Scholar

[17]

H. ZhangQ. CaiW. JiangY. Lv and Q. Liu, Joint voice harmony restoration and morphological segmentation for morphology analysis, Journal of Chinese Information Processing, 6 (2014), 9-17.   Google Scholar

[18]

L. ZhuY. Pan and J. Wang, Affine transformation based ontology sparse vector learning algorithm, Applied Mathematics and Nonlinear Sciences, 2 (2017), 111-122.  doi: 10.21042/AMNS.2017.1.00009.  Google Scholar

show all references

References:
[1]

B. Aisha and M. Sun, A statistical method for Uyghur tokenization, in International Conference on Natural Language Processing and Knowledge Engineering, (2009), 1-5. doi: 10.1109/NLPKE.2009.5313764.  Google Scholar

[2]

Uyghur Language, Available from: https://en.wikipedia.org/wiki/Uyghur_language. Google Scholar

[3]

S. Dandapat, S. Sarkar and A. Basu, Automatic part-of-speech tagging for bengali: An approach for morphologically rich languages in a poor resource scenario, in ACL 2007, Proceedings of the Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, 2007. Google Scholar

[4]

T. Ibrahim and B. Yuan, A survey on minority language information processing research and application in xinjiang, Journal of Chinese Information Processing, 6 (2011), 149-156.   Google Scholar

[5]

T. Klymchuk, Regularizing algorithm for mixed matrix pencils, Applied Mathematics and Nonlinear Sciences, 2 (2017), 123-130.   Google Scholar

[6]

O. Kohonen, S. Virpioja, L. Leppanen and K. Lagus, Semi-supervised extensions to morfessor baseline, Proceedings of the Morpho Challenge 2010 Workshop, 2010. Google Scholar

[7]

T. Kudo, K. Yamamoto and Y. Matsumoto, Applying conditional random fields to japanese morphological analysis, in Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A Meeting of Sigdat, A Special Interest Group of the Acl, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, 6 (2004), 230-237. Google Scholar

[8]

Lafferty, D. John, McCallum, Andrew, Pereira and C. N. Fernando, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, 2001. Google Scholar

[9]

T. Litip, The possibility of handling phonetic harmony by computer in Uyghur, Journal of the Central University for Nationalities, 5 (2004), 108-113.   Google Scholar

[10]

A. MairehabaW.-B. JiangZ.-Y. WangY. Tuergen and Q. LIU, Directed graph model of Uyghur morphological analysis, Journal of Software, 12 (2012), 3115-3129.  doi: 10.3724/SP.J.1001.2012.04205.  Google Scholar

[11]

A. MijitN. GrahamM. MasatoM. ShinsukeK. Tatsuya and H. Askar, Uyghur Morpheme-based Language Models and ASR, Ipsj Sig Notes, (2010), 581-584.  doi: 10.1109/ICOSP.2010.5656065.  Google Scholar

[12]

M. OrhunA. C. eyd Tantug and A. Esref, Rule Based Analysis of the Uyghur Nouns, International Journal on Asian Language Processing, 1 (2009), 33-44.   Google Scholar

[13]

L. Tohti, Modern Uyghur Reference Grammar, China Social Science Press, Beijing, 2012. Google Scholar

[14]

E. TursunD. GangulyT. OsmanY. YatingG. AbdukerimZ. Junlin and L. Qun, A semisupervised Tag-Transition-Based markovian model for Uyghur morphology analysis, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16 (2016), 8-23.  doi: 10.1145/2968410.  Google Scholar

[15]

A. Wumaier, T. Yibulayin, Z. Kadeer and S. Tian, Conditional random fields combined fsm stemming method for uyghur, in IEEE International Conference on Computer Science and Information Technology, (2009), 295-299. doi: 10.1109/ICCSIT.2009.5234727.  Google Scholar

[16]

H. XueY. YangT. OsmanX. Li and R. Zhang, Uyghur word segmentation using a combination of rules and statistics, Advances in information Sciences and Service Sciences(AISS), 3 (2011), 105-113.   Google Scholar

[17]

H. ZhangQ. CaiW. JiangY. Lv and Q. Liu, Joint voice harmony restoration and morphological segmentation for morphology analysis, Journal of Chinese Information Processing, 6 (2014), 9-17.   Google Scholar

[18]

L. ZhuY. Pan and J. Wang, Affine transformation based ontology sparse vector learning algorithm, Applied Mathematics and Nonlinear Sciences, 2 (2017), 111-122.  doi: 10.21042/AMNS.2017.1.00009.  Google Scholar

Figure 1.  The morphological analysis result and hierarchical relationship of a Uyghur sentence
Figure 2.  The Architecture of a semi-supervised morphological analysis based on the hybrid approach
Figure 3.  Morphological Tag Decoding Process of Words in the Sentence
Figure 4.  The Relationship between Parameter $\beta$ and Accuracy
Table 1.  Feature Template of POS Tagging Model
Features Description
${{w}_{i-2}}{{pos}_{i}}$, ${{w}_{i-1}}{{pos}_{i}}$,
${{w}_{i}}{{pos}_{i}}$, ${{w}_{i+1}}{{pos}_{i}}$,
${{w}_{i+2}}{{pos}_{i}}$ Unary context features of the word
${{w}_{i-2}}{{w}_{i-1}}{{pos}_{i}}$, ${{w}_{i-1}}{{w}_{i}}{{pos}_{i}}$,
${{w}_{i}}{{w}_{i+1}}{{pos}_{i}}$, ${{w}_{i+1}}{{w}_{i+2}}{{pos}_{i}}$,
${{w}_{i-1}}{{w}_{i+1}}{{pos}_{i}}$ Binary context features of the word
$h_1(w_i){{pos}_{i}}$, $h_2(w_i){{pos}_{i}}$,
$h_3(w_i){{pos}_{i}}$,
$h_4(w_i){{pos}_{i}}$,
$h_5(w_i){{pos}_{i}}$ n characters selected from the beginning of the word
$t_1(w_i){{pos}_{i}}$, $t_2(w_i){{pos}_{i}}$, $t_3(w_i){{pos}_{i}}$,
$t_4(w_i){{pos}_{i}}$, $t_5(w_i){{pos}_{i}}$ n characters selected from the end of the word
${{pos}_{i-1}}{{pos}_{i}}$ POS tag transition feature
Features Description
${{w}_{i-2}}{{pos}_{i}}$, ${{w}_{i-1}}{{pos}_{i}}$,
${{w}_{i}}{{pos}_{i}}$, ${{w}_{i+1}}{{pos}_{i}}$,
${{w}_{i+2}}{{pos}_{i}}$ Unary context features of the word
${{w}_{i-2}}{{w}_{i-1}}{{pos}_{i}}$, ${{w}_{i-1}}{{w}_{i}}{{pos}_{i}}$,
${{w}_{i}}{{w}_{i+1}}{{pos}_{i}}$, ${{w}_{i+1}}{{w}_{i+2}}{{pos}_{i}}$,
${{w}_{i-1}}{{w}_{i+1}}{{pos}_{i}}$ Binary context features of the word
$h_1(w_i){{pos}_{i}}$, $h_2(w_i){{pos}_{i}}$,
$h_3(w_i){{pos}_{i}}$,
$h_4(w_i){{pos}_{i}}$,
$h_5(w_i){{pos}_{i}}$ n characters selected from the beginning of the word
$t_1(w_i){{pos}_{i}}$, $t_2(w_i){{pos}_{i}}$, $t_3(w_i){{pos}_{i}}$,
$t_4(w_i){{pos}_{i}}$, $t_5(w_i){{pos}_{i}}$ n characters selected from the end of the word
${{pos}_{i-1}}{{pos}_{i}}$ POS tag transition feature
Table 2.  Feature Template of the Morphological Tagging Model
Features Description
${{m}_{i-2}}{{t}_{i}}$, ${{m}_{i-1}}{{t}_{i}}$, ${{m}_{i}}{{t}_{i}}$, ${{m}_{i+1}}{{t}_{i}}$, ${{m}_{i+2}}{{t}_{i}}$ Unary context features of the morpheme
${{m}_{i-2}}{{m}_{i-1}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i}}{{t}_{i}}$, ${{m}_{i}}{{m}_{i+1}}{{t}_{i}}$,
${{m}_{i+1}}{{m}_{i+2}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i+1}}{{t}_{i}}$ Binary context features of the morpheme
${{t}_{i-1}}{{t}_{i}}$ Morphological tag transition feature
Features Description
${{m}_{i-2}}{{t}_{i}}$, ${{m}_{i-1}}{{t}_{i}}$, ${{m}_{i}}{{t}_{i}}$, ${{m}_{i+1}}{{t}_{i}}$, ${{m}_{i+2}}{{t}_{i}}$ Unary context features of the morpheme
${{m}_{i-2}}{{m}_{i-1}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i}}{{t}_{i}}$, ${{m}_{i}}{{m}_{i+1}}{{t}_{i}}$,
${{m}_{i+1}}{{m}_{i+2}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i+1}}{{t}_{i}}$ Binary context features of the morpheme
${{t}_{i-1}}{{t}_{i}}$ Morphological tag transition feature
Table 3.  List of Morphological Tag Candidates of Words in the Sentence
Table 4.  Manually Tagged Corpus Format and Content Example
Table 5.  Details of Experimental Data
Number of sentences Number of words (including punctuation marks) Number of Uyghur words
Training set 1000 12433 10391
Development set 200 2564 2151
Test set 200 2492 2075
Number of sentences Number of words (including punctuation marks) Number of Uyghur words
Training set 1000 12433 10391
Development set 200 2564 2151
Test set 200 2492 2075
Table 6.  Experimental Results
Method Accuracy (%)
Stemming Morpheme segmentation POS Overall
Tag sequence Markov model 90.18 83.25 86.17 75.13
Joint CRF model 91.98 85.79 92.7 77.95
Tag sequence Markov model, $\alpha$=0.95 92.65 88.47 88.12 79.65
Joint CRF model, $\alpha$=0.9 92.85 89.76 92.6 80.73
Method Accuracy (%)
Stemming Morpheme segmentation POS Overall
Tag sequence Markov model 90.18 83.25 86.17 75.13
Joint CRF model 91.98 85.79 92.7 77.95
Tag sequence Markov model, $\alpha$=0.95 92.65 88.47 88.12 79.65
Joint CRF model, $\alpha$=0.9 92.85 89.76 92.6 80.73
Table 7.  Analysis for the Influence of Filtering Rules on Morphological Tagging
Method(Joint CRF model, $\alpha$=0.9, $\beta$=0.1) Accuracy (%)
Stemming Morpheme segmentation POS Overall
Joint CRF model,
$\alpha$=0.9, $\beta$=0.1,
When filtering rules are not used
92.85 89.76 92.6 80.73
Joint CRF model,
$\alpha$=0.9, $\beta$=0.1,
When filtering rules are used
97.4 94.58 96.35 92.58
Tag sequence transition model,
$\alpha$=0.95,
When filtering rules are used
94.35 93.22 94.78 91.81
Method(Joint CRF model, $\alpha$=0.9, $\beta$=0.1) Accuracy (%)
Stemming Morpheme segmentation POS Overall
Joint CRF model,
$\alpha$=0.9, $\beta$=0.1,
When filtering rules are not used
92.85 89.76 92.6 80.73
Joint CRF model,
$\alpha$=0.9, $\beta$=0.1,
When filtering rules are used
97.4 94.58 96.35 92.58
Tag sequence transition model,
$\alpha$=0.95,
When filtering rules are used
94.35 93.22 94.78 91.81
[1]

Seung-Yeal Ha, Shi Jin. Local sensitivity analysis for the Cucker-Smale model with random inputs. Kinetic & Related Models, 2018, 11 (4) : 859-889. doi: 10.3934/krm.2018034

[2]

Lekbir Afraites, Abdelghafour Atlas, Fahd Karami, Driss Meskine. Some class of parabolic systems applied to image processing. Discrete & Continuous Dynamical Systems - B, 2016, 21 (6) : 1671-1687. doi: 10.3934/dcdsb.2016017

[3]

Jean-François Biasse. Improvements in the computation of ideal class groups of imaginary quadratic number fields. Advances in Mathematics of Communications, 2010, 4 (2) : 141-154. doi: 10.3934/amc.2010.4.141

[4]

Guillaume Bal, Wenjia Jing. Homogenization and corrector theory for linear transport in random media. Discrete & Continuous Dynamical Systems - A, 2010, 28 (4) : 1311-1343. doi: 10.3934/dcds.2010.28.1311

[5]

Reza Lotfi, Yahia Zare Mehrjerdi, Mir Saman Pishvaee, Ahmad Sadeghieh, Gerhard-Wilhelm Weber. A robust optimization model for sustainable and resilient closed-loop supply chain network design considering conditional value at risk. Numerical Algebra, Control & Optimization, 2021, 11 (2) : 221-253. doi: 10.3934/naco.2020023

[6]

Jan Prüss, Laurent Pujo-Menjouet, G.F. Webb, Rico Zacher. Analysis of a model for the dynamics of prions. Discrete & Continuous Dynamical Systems - B, 2006, 6 (1) : 225-235. doi: 10.3934/dcdsb.2006.6.225

[7]

Fumihiko Nakamura. Asymptotic behavior of non-expanding piecewise linear maps in the presence of random noise. Discrete & Continuous Dynamical Systems - B, 2018, 23 (6) : 2457-2473. doi: 10.3934/dcdsb.2018055

[8]

Sohana Jahan. Discriminant analysis of regularized multidimensional scaling. Numerical Algebra, Control & Optimization, 2021, 11 (2) : 255-267. doi: 10.3934/naco.2020024

[9]

Qiang Guo, Dong Liang. An adaptive wavelet method and its analysis for parabolic equations. Numerical Algebra, Control & Optimization, 2013, 3 (2) : 327-345. doi: 10.3934/naco.2013.3.327

[10]

Vieri Benci, Sunra Mosconi, Marco Squassina. Preface: Applications of mathematical analysis to problems in theoretical physics. Discrete & Continuous Dynamical Systems - S, 2021, 14 (5) : i-i. doi: 10.3934/dcdss.2020446

[11]

Martial Agueh, Reinhard Illner, Ashlin Richardson. Analysis and simulations of a refined flocking and swarming model of Cucker-Smale type. Kinetic & Related Models, 2011, 4 (1) : 1-16. doi: 10.3934/krm.2011.4.1

[12]

Rui Hu, Yuan Yuan. Stability, bifurcation analysis in a neural network model with delay and diffusion. Conference Publications, 2009, 2009 (Special) : 367-376. doi: 10.3934/proc.2009.2009.367

[13]

Israa Mohammed Khudher, Yahya Ismail Ibrahim, Suhaib Abduljabbar Altamir. Individual biometrics pattern based artificial image analysis techniques. Numerical Algebra, Control & Optimization, 2021  doi: 10.3934/naco.2020056

[14]

Dan Wei, Shangjiang Guo. Qualitative analysis of a Lotka-Volterra competition-diffusion-advection system. Discrete & Continuous Dynamical Systems - B, 2021, 26 (5) : 2599-2623. doi: 10.3934/dcdsb.2020197

[15]

Hailing Xuan, Xiaoliang Cheng. Numerical analysis and simulation of an adhesive contact problem with damage and long memory. Discrete & Continuous Dynamical Systems - B, 2021, 26 (5) : 2781-2804. doi: 10.3934/dcdsb.2020205

[16]

Jiangxing Wang. Convergence analysis of an accurate and efficient method for nonlinear Maxwell's equations. Discrete & Continuous Dynamical Systems - B, 2021, 26 (5) : 2429-2440. doi: 10.3934/dcdsb.2020185

[17]

Hailing Xuan, Xiaoliang Cheng. Numerical analysis of a thermal frictional contact problem with long memory. Communications on Pure & Applied Analysis, , () : -. doi: 10.3934/cpaa.2021031

[18]

Carlos Fresneda-Portillo, Sergey E. Mikhailov. Analysis of Boundary-Domain Integral Equations to the mixed BVP for a compressible stokes system with variable viscosity. Communications on Pure & Applied Analysis, 2019, 18 (6) : 3059-3088. doi: 10.3934/cpaa.2019137

[19]

Xiaoyi Zhou, Tong Ye, Tony T. Lee. Designing and analysis of a Wi-Fi data offloading strategy catering for the preference of mobile users. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021038

[20]

John Leventides, Costas Poulios, Georgios Alkis Tsiatsios, Maria Livada, Stavros Tsipras, Konstantinos Lefcaditis, Panagiota Sargenti, Aleka Sargenti. Systems theory and analysis of the implementation of non pharmaceutical policies for the mitigation of the COVID-19 pandemic. Journal of Dynamics & Games, 2021  doi: 10.3934/jdg.2021004

2019 Impact Factor: 1.233

Article outline

Figures and Tables

[Back to Top]