Over the past few decades, machine learning theories and applications have been on the brink of technology and artificial intelligence; creeping into the fields of commerce, production and service industries, as well as administration. Perhaps, one of the most renounced areas of application of the concepts of machine learning is the field of data mining and language recognition, including several applications in the financial predictions, adaptive industrial systems, construction of web user profiles, and medical diagnosis. Machine learning has been growing in recognition due to its primary roles in a broad range of essential applications such as image recognition, natural language processing, and automated expert systems. Analyzing and taking a deeper probe into the applications of machine learning (in the natural language processing), this research focuses on the use of machine learning in voice or speech recognition.
Samuel (1959) defined Machine Learning as a discipline (field of study) in computer science, which provides computers with the ability to learn without any explicit programming. Later on, Michalski, Carbonell, and Mitchell (2013) improved the definition of machine learning. They stated that computers are said to have learnt from experience (E), in relation to the task (T) and performance measure (P), provided the performance on T improves through the experience (E), as measured by P as the performance measure. Machine learning evolved from artificial intelligence through a series of studies conducted on computational learning and pattern recognition theory (Samuel, 1959). It explores the construction and study of algorithms, which can automatically learn and make predictions from the raw data provided. Such algorithms can override the static program instructions through making data-driven decisions and predictions (Samuel, 2000). These processes are made possible through the algorithmic construction of different models from the sample data input.
Voice Recognition and its Problems in Machine Learning
The application of speech recognition currently dominates the day-to-day human life as the system is built into computers, mobile phones, smart watches, and game consoles. It has even led to the automation of machines, as well as homes. Due to this widespread application and increasing technological improvements, the voice recognition as a sub-field in machine learning has been at the center of several studies in computer science. Its definition, application, and related problems as a segment of machine learning are thereby elaborated in this section.
Voice recognition, also commonly known as automatic speech recognition (ASR) or merely speech recognition involves the conversion of spoken words into a series of written or readable words. (Chigier & PureSpeech, Inc., 1997) Speech recognition denotes the identification of the speaker, rather than the exact information they convey. The recognition of the speaker can be essential in systems speech translation, which involves the algorithmic training of the system on a particular individuals voice. This technology can also be useful in the authentication or verification of an individuals identity as part of security strategies. According to Roberto (2011), voice recognition, in machine learning, is a disciplinary subset of computational linguistics, which aims at developing the technologies and methodologies that enable the recognition, translation, and transcription of spoken speech into textual words by the computer. Alternatively, Junqua and Haton (2005) described automatic speech recognition as the process of decoding the acoustic signals from the human voice into a series of linguistic pieces containing a message conveyed by the speaker. It incorporates the knowledge and skills in computer science, linguistics, and engineering fields. Several voice recognition systems employ the methods of training or enrolment, where the speaker reads an isolated vocabulary or text verbally into the system. The programmed system then analyzes the individuals unique voice and fine-tunes it to recognize the specific speech by the person through the currently accurate systems. Computerized voice recognition systems that do not employ the use of training are referred to as speaker independent systems while those that employ training are referred to as speaker dependent systems.
From the technological perspective, the history of voice recognition has been there for several decades, with several waves of groundbreaking innovations. In the recent years, the technology behind the speech recognition has gained relevance due to the advances in big data and deep learning (Witten & Frank, 2011). These advancements have been evident through the worldwide industrial adoption of different deep learning approaches in the design and deployment of voice recognition systems (Roberto, 2011). For instance, the key industrial players in the advances of speech recognition systems include the Microsoft Company, Google, IBM, Apple, Baidu, Amazon, SoundHound, Ifly Tek, and Nuance among others. These voice recognition technology participants have evidently publicized that deep learning has been at the center of their voice recognition system technologies.
The applications of speech recognition technologies are far-reaching. These include the voice user-interfaces, such as (i) call routing, (ii) voice dialing. (iii) Voice-search, (iv) robotic appliance control, (v) structured document preparations, (vi) voice data entries, such as credit-card-number entry, (vii) aircraft guidance (direct voice input), and (viii) speech-to-text processing. In the car manufacturing industry, the use of speech recognition has been essential for the manufacture of inbuilt car systems, including audio prompts, automated phone calls, radio operation, and music playing (Chigier & PureSpeech, Inc., 1997). Different car manufacturing companies implement speech recognition input systems that offer natural language voice recognition, with fixed commands through common phrases and complete sentences. In health care, speech recognition is now used for medical documentations and therapeutic support. A number of decades ago, the military and aircraft manufacturers were already using the voice recognition technology in the control of high-performance fighter jets, helicopters, and the training of air-traffic controllers. Moreover, the speech recognition techniques are currently used to ease the lives of people with disabilities, especially the deaf and those with difficulties in hearing. This group of individuals can use different voice recognition software for the automatic generation of captioned conversations during conferences, religious services, and classroom lecture sessions.
Regardless of the several applications and groundbreaking advances in the voice recognition technologies, there are still some challenges and problems associated with the speech recognition in machine learning. One of the most challenging problems is the fact that different people have different voices and varying speeds of their spoken statements. For instance, when saying speaking the word hello, one individual may say it very quickly while the other person may prolong it as heellloooo!' Such longer pronunciations may produce longer sound files, which may, in turn, enclose more data. In speech recognition, a particular spoken word or statement of the same meaning should be recognized uniformly and exactly as a single text. However, the automatic alignment of audio files of the same word or statement but with varying textual lengths still proves to be a great challenge to the voice recognition technology.
Another problem encountered in the voice recognition technologies is the difficulty in interpreting body language by the computerized gadgets used. As opposed to the machines, human speakers do not only speak but also send body signals, such as eye movements, hand waving, and postures. These techniques pose a challenge to machines since they are unable to recognize or identify any form of body language; hence posing a great problem for the voice recognition technology (Chigier & PureSpeech, Inc., 1997). Moreover, the human recognition, interpretation, and comprehension of any form of speech involve more than the mere use of ears while listening. They use the knowledge and skills in the interpretation of the topic and the speakers message, such as the comprehension of redundancy and grammatical structure. On the other hand, machines do not possess this ability of speech comprehension through the knowledge and skills about the speaker, as well as the message conveyed. This inability poses a great problem to the advances in speech recognition technology as part of machine learning (Witten & Frank, 2011). The other problem that poses a challenge to the voice recognition technology is the difficulty to comprehend a continuous speech by machines. Here, a machine cannot distinguish the word or sentence boundaries, such as pauses and punctuations; hence a challenge to the speech recognition advances.
How the Voice Recognition Works in Machine Learning (system using RNN)
Voice recognition engines or software incorporates an in-built speech recognizer, which accepts or captures an audio stream as the input, then converts it into a transcription text. The process of speech recognition has both front and back ends. At the front end, the audio stream is processed through the isolation of sound segments, which are parts of the actual speech. This end of the recognizer then converts the input audio stream into a sequential numerical value, which then characterize the signal's vocal sounds (Chigier & PureSpeech, Inc., 1997). On the other hand, the back-end encompasses a specialized search engine, which then accepts the output from the front-end, then searches across its databases (commonly three), including the lexicon, acoustic model, and finally the language model.
Just to elaborate on the three databases mentioned above, the language model is a database representing the manner in which the language words from a spoken speech are combined. Secondly, the lexicon database takes the role of listing a large number of the recognized words from the language, thereby providing information on the best way to pronounce such words. The third database is the acoustic model, which represents a languages acoustic sounds. This segment can be trained to identify or recognize the features of a particular individuals patterns of speech, as well as the characteristics of the acoustic environments. As Alpaydin (2014) reveals, for any particular sound segment, there are presumed several things that the speaker could be saying, which the speech could potentially mean. Therefore, the recognizers quality is predetermined by best it can refine the search process, thereby selecting the most likely and most appropriate matches while eliminating the least probable and inappropriate matches (Waibel, Hanazawa, Hinton, & Lang, 2009). The performance of this role is dependent on the recognizers quality of language, acoustic models, as well as its algorithmic effectiveness, both on the course of sound processing and searches conducted across the entire model.
While the recognizer's inbuilt language model is programmed to characterize a very comprehensive domain of the language, such as the commonly spoken English, the application of speech recognition would always necessitate the processing of only a particular group of utterances, which have specific semantic meanings. Rather than employing the general-purpose language models, many speech recognition applications are programmed to use t...
Request Removal
If you are the original author of this essay and no longer wish to have it published on the customtermpaperwriting.org website, please click below to request its removal: