
Natural Languages processing is field of Computer Sciences, which performs interaction between human languages and computers. The Sindhi NLP deals with Sindhi language and solves computational linguistics problems of Sindhi language. Sindhi language is one of the oldest language of the World having fifty two alphabetical letters and space to adopt several other languages lexicons. This language is written, read and spoken all over the World. Sindhi language is complex grammatically and rich morphologically. The grammar of Sindhi language is not the same as the grammar of English and other languages even the meaning and sense of understating of Sindhi lexicons are different. The diacritics change the meaning, number and gender of the Sindhi lexicons.
Most of the research studies concentrate on English text parsing and Sentiments analysis. A variety of reliable resources for English Language including text and tools for syntactic parsing and sentiment analysis are available. However, the online resources for languages other than English like Sindhi, are limited even in this digital era. The style, grammatical structure and domain of Sindhi language is different from the other languages of the World. To work on Sindhi text parsing, word tokenization, part of Speech tagging, Sindhi WordNet, Corpus and sentiments analysis is not same as to work on the English text. This difference creates research question for working on Sindhi text as it is one of the oldest and morphological rich language of the World having Arabic and Devanagari writing styles.
The main focus of this research lies on understanding the challenges, issues regarding working on development of Sindhi Parser, statistical analysis, Sindhi and Universal parts of speech (SPOS) tagging and word Tokenization. As there is no availability of online Sindhi Corpus and Sindhi WordNet in proper way, therefore, it is again a challenging task to develop Sindhi WordNet , Sindhi Parser and Sindhi Corpus to perform text analysis. Sindhi NLP website provides with two tools at this stage for the purpose of research work. The number of lexicons are limited as these tools are developed only for research studies therefore, all types of the words and sentences cannot be analyzed and described.
Using these developed tools, further research may be conducted on the machine translation, information retrieval as well as text and lexicon analysis. These research tools may provide a platform to work more on right hand written languages to connect with English and other languages. As this is digital era therefore, there is need to conduct more researches on all languages of the world to develop common UPOS and lexicon analysis tool. My research interests are Computational linguistics, Data Mining , Machine Learning, Deep learning and Artificial intelligence.
Online Sindhi Parser: A language parser is the program that describes the structure of text of any language. It is natural language processing tool that divides the sentence into several segments according to its grammatical structure. This segmentation is called word tokenization. Sindhi online resources are growing on internet day to day after the development of Unicode based Keyboard.
The preented Sindhi online syntactic parser is a program that describes the Sindhi text with proper segmentation, tokenization, grammatical tagging, syntactic parsing, statistical and morphological analysis. Sindhi online parser is using UPOS (Universal Part of Speech), SPOS (Sindhi Part of Speech) to tag and syntactically parse the Sindhi text. There are four algorithms used to develop this tool. The tokenization algorithm splits the Sindhi text into independent tokens and assign them sequence numbers. The tagging algorithm tag the UPOS and SPOS to Sindhi text after proper segmentation. Syntactic parsing algorithm identifies Sindhi tokens and assign them with phrase and UPOS, extends the tree hierarchically. The statistical algorithm measure the execution time, number of tokens, frequencies of phrase, UPOS and morphological forms of Sindhi words. Sindhi keyboard is available on the site to see the characters of Sindhi language. Text box inputs only Sindhi characters. This Sindhi online syntactic parser, performs all types of described functions and features on basis of available Sindhi words in database. As it is mentioned above that words are limited in database, therefore, all types of sentences cannot be analyzed.
Sindhi WordNet: Sindhi WordNet is a lexical database for the Sindhi language Nouns, Adjectives, Adverbs and Verbs. It groups Sindhi words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. The word stemming, hyponyms and lemmatization of Sindhi words are available in Sindhi WordNet. WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms, strings of letters, but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.
The main relation among words in WordNet is synonymy, like relation between the words love and affection or in Sindhi پيار and محبت. Synonyms words that denote the same concept and are interchangeable in many contexts are grouped into unordered sets called synsets. Each synset of WordNet is linked to another synset by means of a small number of “conceptual relations.” Additionally, a synset contains a brief definition and, in most cases, one or more short sentences illustrating the use of the synset members. Word forms with several distinct meanings are represented in as many distinct synsets. Thus, each form-meaning pair in WordNet is unique. Hyponym is a word with a broad meaning constituting a category into which words with more specific meanings fall; a superordinate. For example, colour is a hypernym of red, blue and others.
Meronym, shows the relation like arm is part of human body. Parts are inherited from their super-ordinates: if a body has arms, then a hands are part of arms as well. Parts are not inherited “upward” as they may be characteristic only of specific kinds of things rather than the class as a whole.
Sindhi Lemma: Computational Linguistics define the Lemmatisation as the process of determining the lemma of a word based on its intended meaning and identification of part of speech. Lemma is basic form of word which makes sense of understanding. While stemming is different than the Lemma because stemmer shows the word without knowledge of the context. In various languages, words seem in many inflected forms. For example, in English, the verb 'to laugh' may appear as 'Laugh', 'Laughed', 'Laughs', 'Laughing'. The base form, 'Laugh', that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word. The Lemma of Sindhi words removes the inflection of Sindhi words and shows the base form word like "هيءُ" , "هيءَ" , "هي". The base form is "هي" which shows Sindhi part of speech "ضمير" called determiner / pronoun. Even verb "کلڻ", may appear as "کليو" , "کلي", "کل". The base form is "کل". So this lexeme is associated with Sindhi part of speech "" called Noun.

I am doing my doctorial research on Sindhi Natural Languages Processing (NLP) at Shaheed Zulifqar Ali Bhutto Institute of Science and Technology (SZABIST) Campus Karachi, Sindh Pakistan. I am working as Assistant Professor at Department of Computer Science and Information Technology, Benazir Bhutto Shaheed University Karachi, Sindh Pakistan. The presented tools are the basic requirement of my research study. These tools may be benificial for those persons who want to conduct research studies on Sindhi language and text. The online Sindhi Parser and WordNet fullfil basic requirements of tokenization, tagging , syntact parsing and words relation and stemming therefore, there is need of more work to upgrade these tools for research on universal dependency, semantic and sentiment analysis of Sindhi text.
My research areas are Computational linguistics, Data Mining , Machine Learning, Deep learning and Artificial intelligence.

Engr. Dr. Asim Imdad Wagan is Associate Professor of Computer Science and Dean Academics at Mohammad Ali Jannah University Karachi Sindh Pakistan. He is a member of IEEE, and member of the Pakistan Engineering Council. He has earned both his PhD and MS in Computer Science at the National Institute of Applied Sciences of Lyon (INSA de Lyon, France). He is an expert in various areas of Computer Science. His research interests include Machine Learning, Data Science, Computer Vision, Image Processing, Soft Computing, Deep Learning, Specialties: Image, Video and 3D Processing and Analysis, Self-Driving Car, Image and Video Classification, Document Processing.
Dr. Wagan is supervisor of my doctorial research studies on Computational Linguistics issues of Sindhi language at SZABIST Karachi campus, Sindh Pakistan. He provided me with very good supervision, proper guidance, constant support, encouragement and environment for working on this research study.