Chapter two: Background Review
To illustrate the text-speech corpora tool, it needs to deal with the significant of the corpora tools and the domains that are used. Each domain has some significant roles in the various scientific technologies. Therefore, there are several types of speech corpora which are implemented. In spite of that this chapter devoted for demonstrating and investigating the previous works that related for this work, it also describes the factors that classified the data speech according to the data collection.
This chapter is classified into several crucial sections; the first section allocates to demonstrate the meaning of term corpora and definition of some expressions which are used in the research such as text corpora and speech corpora. Subsequently, the significant of this section which is dedicated to argue the structure of text corpora and speech corpora that are used in the previous tools in speech technology and linguistics study. In addition, the second part of this chapter is devoted to discuss the data speech classification and the factors which influenced for data speech collection. Furthermore, third section is explained the types of speech that is recorded during data collection. Other section is focused on the method of collecting data and the tools that have been implemented for this purpose. Pre-ultimately, the other section is pointed out the important of text-speech corpora tool and is determined the fields and technologies that have been used. For instance, according to (reference) in speech technology are used to create acoustic models, and in linguistic technology used for transcription, photonic and Conversation analysis. The other section pointed out some significant tools that have been implemented before. Finally, this chapter has the specific section to summarize all evaluations and results that have been received.
2.1 Definition and Terminology
This section discusses corpora word and the expression related to this term such as text corpus and speech corpus because both of them have a great critical situation to study speech recognition and linguistics in a particular language.
The word of ‘Corpus’ is the singular of ‘Corpora’. It is a Latin word which means signifies body. The word ‘Corpus’ has many meanings which are changing depend on the field that is used. According to the (Farlex, Inc, 2004) in literary and Literary Critical Terms, it means “a collection or body of writings.” Nonetheless, in linguistics, it means “the body of data”. Also, it is the main part of an organ or structure as stated in the Life Sciences and Anatomy. Consequently, this word indicate that “a capital or principal sum, as contrasted with a derived income” as described in Economics, Accounting and Finance, in spite of that this word is used for several other purposes.
The word of Corpora has many different scientific definitions, but all of them have nearly the same meaning. According to Crystal (1992) the most famous definition in linguistics is “A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech”. The basic aim for this expression is to check out a hypothesis about language, for instance, for determining how the usage of a specific sound, word, or syntactic construction differs (McArthur, 1992).
2.1.1 Text Corpora
The meaning of word ‘Text’ is clearer and it represents words, phrases, sentences, and paragraphs. Moreover, most of linguists have the same idea about the meanings of this word. There are not huge differences between ‘Speech-Corpus’ and ‘Text-Corpus’ because both of them complete each other. In accordance with freetechexams.com(2005) “Text corpus is the technique which is used in linguistics and mainly it is used for the purpose of referring to the texts which had been stored and processed with the help of some electronically held”. Moreover, text corpus is a gigantic and structured set of texts and it is usually electronically stored and processed. Both text-corpus and speech corpus is helpful to make inquiries about statistical analysis and hypothesis testing, also it is useful to check occurrences or validate linguistic rules on a particular universe. Consequently, corpus is introduced according to the number of language which is composed text. Cutups is known as ‘monolingual corpus’ whether text corpus contains a single language (Bowker, 1998); oppositely, text includes more than one language (multiple languages), the corpus is called ‘multilingual corpus’ figure: 1 is shown the types of corpus according to the number of language.
Figure 2.1: explain the type of text corpus (It is impacted from Bowker, 1998)
There are unclosed questions in corpus to choose or select text corpus in text- corpora tool, and the purpose of the text that provided in the corpora tools. Some references and philosophers mention that lexicographers, linguists and researchers should be selected the texts t or got important advices from English teachers, as long as the corpus developer or designer must create a process to show the text randomly (Aijmer and Altenberg 1991). For instance, the recording audio from http://www.voxforge.org/home/read is the best example tool that is determined the text to read by the participants. It is crucial to indicate the method to select the text corpus because speech-text corpus completely depends on the text have been chosen to deal with the sound which is recorded in the speech technology and linguistics. There is no doubting every language has a particular structure and each text has been written according to the domain which has been published it, for example the text may deliver from journals, magazines, books, leaflets (booklets), and letters(Stubbes, 1996). All text languages have own structure and changed according to the situation that have written the text (Douglas, 1989). For instance, the structure of text English is shown in Figure 2.2.
Figure 2.2: The Structure of English Text Corpus
(It is impacted from Holmes-Higgin et al., 2004)
The type of the text is fiction or non-fiction, but these types have some factors, such as imaginative, educational (informative) or inventive and instructional or persuasive. Each factor has impacted on the result that has been preferred to use in the tools. These parameters show that how is composed the text structure in the corpus. For example, Lancaster Oslo Bergen (LOB) Corpus was classified into informative texts and imaginative texts. Figure 2.3 shows the structure of the LOB corpus (Holmes-Higgin,Sibte, Abidi et al., 2004).
Figure 2.3: The Structure of LOB Corpus (it is taken from Holmes-Higgin et al., 2004)
In spite of the significant of text context in the domain corpora tools, some atomic features are influenced in the type of the tools and result of the researchers. For instance, gender and level age of author, period and location of publication of text, language variety and etc (Biber, Douglas, 1989: pp3-43).
2.1.2 Speech-Corpus
A Speech-Corpus is also called a ‘spoken corpus’; such as Corpus word, there are several expressions that defined it. Speech-Corpus is a collection of speeches preserved the capture audio. These collections are helpful for performing linguistic studies and for growing speech software (wiseGEEK, 2003). Furthermore, Dafydd Gibbon (1997) is sited that “any collection of speech recordings which is accessible in computer readable form and which comes with annotationand documentationsufficient to allow re-use of the data in-house, or by scientists in other organizations” is known as the spoken corpus. Also, this term defined by Crystal (1991) as “a collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language”.This clarification shows that speech corpora contains a database that collecting speech audio files and text transcriptions. In many other references this definition is very wide and liable. Generally, Speech corpus are used in two curtail aspects, firstly, it is beneficial for recognizing speech in speech technology. On the other hand, in Linguistics spoken corpus is used to study phonetic, conversation analysis, dialectology and some other fields that related from this science (McEnery and Wilson, 1996).
2.2Data Speech Classifications
In the previous section confirms that the text-speech corpora include both speech and text corpora which contain the database are used for storing text and audio capture. As a result, there are several types of techniques to collect the data for creating Corpora. Moreover, the type of corpora is classified according to the system to accumulate the data speech and information.
According to Vianaz et al. (1998)speech data collection is divided into several types, but this classification is generally adjusted by some factors which relate to the method that collected the data. This section investigates some of these scopes that are impacted on the collection data.
2.2.1 Data speech clarity or visibility
It is no doubt that data clarity and data visibility is a significant object for collecting each type of data. In general, Data Clarity is a scientific approach to endlessly improving and keeping the health of data. This model renews the focus on data quality by going beyond regularly assumed approaches and taking it further, into a structured synthesis of modules that builds a strong foundation for saving high-quality data at likeable levels. Thus, data clarity is concentrated on data quality which is proved the data from high level of data accuracy and integrity to enable meaningful data analysis (Cognizant, 2009).
Normally, the aspect of data clarity or visibility has directly a relation with the data policy. It means that the recipient has permission for the data or keeping the data a secret. For instance, NeoSpeech, Inc. recognizes that privacy is important. The privacy policy applies to participant’s use of the NeoSpeech Text-To-Speech Web Service offered by NeoSpeech subject to the web service terms of use. By using the NS TTS web service, participant allows to collect and use your personal information NeoSpeech (2011).
2.2.2 Data speech environment
The collecting data is focused on the atmosphere that recorded the data. For example: the data collecting in laboratory, studio, telephone, or etc. The data speech may be recorded in an anechoic room, his/her own homes or offices or somewhere will be chosen. Certainly, the location is impacted on the data speech (Vianaz et al., 1998).
Experiments is looking for using of standard speech corpora for developing and evaluating is one of the major factors behind progress in automatic speech processing particularly in speech and speaker recognition. Possibly the major advantage of using normal corpora is that it allows researchers to compare performance of dissimilar techniques on general data, hence making it easier to locate which mechanisms are most favorable to pursue (Campbell and Reynolds, 2003). For this purpose, to record data speech is crucial to find an acoustic environment. Experiments should determine the possible environment to collect the data. One of the key advantages in finding or preparing the environment is being capable of reproducing the accurate same sound when the experiment editing participant’s recordings. Thus, the sound record is changed according to the techniques that used to collect the data. For example, collecting data in the libratory or studio is a great difference with the data which collecting by telephone or the normal place such as office or unfixed room for this purpose (I’d Rather Be Writing, 2010).
Consequently, the type of microphone such as a high quality microphone and transferable recorder are used to capture the audio is examples from a range of different environments (Ma, Milner and Smith, 2006). According to the website of I’d Rather Be Writing(2010)for obtaining accurate data sound, experiment is tried to find and prepare the possible environment which is known as acoustic room. This room has some specialty to record the audio sound, for instance: cloth panelling on walls, isolation from other people, lockable, and windowless.
2.2.3 Data speech control
this dimension is focused on the way to record audio or interact between participate and tool, for instance, the speech collecting by random, spontaneous dialogue, interview or read speech recording.
It is clear that random and spontaneous speech has a great role in speech corpora to recognizing speech technology. Thus, increasing the tool for recognizing speech depends significantly on raising the performance of recognition for spontaneous speech. Accordingly, it ought to build huge spontaneous speech corpora for establishing acoustic and language models. That’s why, experiments focus on various achievements Spontaneous Speech and the method to collect the random and spontaneous speech. Because different spontaneous-speech has specific phenomena, such as full of pausing, repairing, hesitations and repetitions, spontaneous speech is required various new techniques for recognition. For example, the new technique is named of automatic summarization which practically including indexing and a process extracts vital and dependable parts of the automatic transcription (Furui, 2005). See (3.3.2) to read about the spontaneous speech.
2.2.4 Data speech monitoring and validation
Monitoring of data speech is implemented by online and offline. This task is helpful to control data during the data record and modify technical and phonetic characteristics on-line, i.e. during process of capturing is happened. Consequently, the task is performed after data collection to check the data and separate the possible data from useless data. This process is a monitor speech data by off-line (Vianaz et al., 1998).
On the other hand, validation is a great relation for a posteriori evaluation of the recorded material. For instance, the data speech of read corpora is recorded on studio and is monitored by someone from outside the recording room; in spite of the monitor has audio contact with the speaker. But this situation is very difficult to be performed in the software system because software system cannot control by monitor and the time of recording is not limited. Thus, the phonetic controlling characteristics are also limited to cases in which the pronunciation is not naturally assumed. The only differentiate is the read a sentences, in self-monitoring took place, i.e., the participant has to listen to the sentence that have been recorded and to tried repeat it if any problem is occurred. Validation was used instead of monitoring in telephone recording and dialogue (Russell, Corley, and Lickley, 2011).
2.2.5 Data speech channel
Generally, the channel that is depending on recording speech data is divided into two main types. Firstly, the single channel is used single microphone to record the speech, but the second one is a multiple channel that is utilized to capture speech and it is more than one microphone(Vianaz et al., 1998). Thus, an audio channel is a single path of audio data. Multi-channel audio is any audio which uses more than one channel simultaneously, allowing the transmission of more audio data than single-channel audio. Supporting the driver of user is problem because the standard sound interfaces for many operating systems are designed before the multi-channel recording is performed, and just only allowing for up to two channels of recording (Microsoft, 2012). The table 2.1 is shown the description of some corpora method that used to collecting data.
Table 2.1: Comparison some corpora tools
(Adapted from Campbell and Reynolds, 2003)
2.3Speech Type
Typically, speech corpora divided into common parts as reported by the method to receive the data. The first one is read speech and Spontaneous Speech is a second type (Helgason, 2010).
2.3.1 Read speech
Read speech is used for collecting data in speech corpora. For this purpose, some speech corpora tools are developed to read the text and recording the speech. The aim of establishing tools is to be used for the voice recognition or just be fun to play with. Read speech corpus includes professional high-quality recordings of a speaker’s voice (Alam et al., 2010).
This type of speech corpus including:
- Book excerpts: based on a manuscript that usually consists of a passage and quote or a piece of text that have been taken from books and journal.
- Broadcast news: read speech can also be collected from TV and radio broadcasts in which case the text is not constructed particularly for the collection of data.
- Lists of words: some tools have asked speakers to read of words that have been chosen specifically for the experiment.
- Sequences of numbers: tools have specified to ask the participant to read the sequence of number. This kind is usually used in telephone corpora tool.
2.3.2 Spontaneous speech
Though speech is in approximately any situation spontaneous, recognition of spontaneous speech is a domain which has only in recent times emerged in the range of automatic speech recognition. It is significant to progress the application of speech recognition because it depends crucially on raising recognition performance for spontaneous speech. As a result, it should analyze and model spontaneous speech using spontaneous speech databases, because spontaneous speech and read speech are significantly different (Furui et al., 2005).
Spontaneous speech can also be said non-careful speech is now surging or stormy. This could cause a very various and difficult understanding of how speech and communication work. Spontaneous speech often includes sequences with such strong decreasing phenomena that one could never have forecasted them, and is rather surprised to look at them when someone studies the spectrogram (Prins and Bastiaanse, 2004) The figure 2:4 is shown the spontaneous speech and multiple deletions, reduction of stops to fricatives, and changes to vowel qualities.
Figure 2:4 show spontaneous speech before and after deletion stops
(Taken from Warner, 2009)
This type of speech corpus also includes (Richey, 2002):
- Dialogs and meetings: contain entirely the free conversations between two or more speakers. In Almost all cases, includes impromptu speech (i.e. no manuscript).
- Narratives: this speech is specific for one person who narrating a story or an accident.
- Appointment-tasks: this speech is happen between two people or more are given individual schedules.
- Telephone conversation: No writing and read a particular text, but can still be equipped towards a Specific subject.
For accuracy, Linguists usually tried to collect their data in a phonetics laboratory where there is high-quality equipment recorder. Hence, this work is very hard to get the perfect numbers of participant and collect actual sample which is necessary for their research.
On the other hand, for the remaining read speech corpora, the text presents on paper or on the computer screen. While the technology is progressed, most of the tool used web interface to present the text and also helpful to experimenter for collecting data easily. For example, in some previous tools such as (Voxforge, 2006), (Gruenstein, 2009), (Schultz, 2007) have been provided to collect speech data remotely via web-based interfaces. In the Voxforge tool, participant read the text to record participant’s speech.
2.4 The collection of data speech corpora
To collect and label spoken naturallanguage may be time consuming and more expensive for experimenter. Typically, recordings are created of subjects implementing a task that evolves speech, which should be transcribed later. In this section, provide some approaches to speech data collection via online or via cell phone. In addition, it should be argued the advantage and disadvantage the technique of data collection. Moreover, discuss some problems due to the terrible for experimenters and participant during the recording speech.
Linguists and researchers are more focused on the technique of collecting data and the type of the data which is collected. Computer and electronic equipment are used to collect the data. Hence, developing technology is increased and also corpora are contained a huge storing data. In spite of that, Statistics (Durant & Smith, 2007) discover that 87% of readers primarily depend on the Internet. This factor due to motivate developer to create online corpora tools for collecting data.
According to LumenVox (2011) to make speech corpora need to at least thousand different speakers, split evenly between male and female speakers. To ensure the greatest diversity of speaking styles, the speakers should represent a variety of ages. The majority of the speakers should be between twenty and forty years old. It shows that creating data corpus has been different in speech technology and each linguists has own condition to collect the data. As a result, the developer should be provided the specific tool is able to get personal information from participant in spite of recording his/her speech.
In addition, in linguistics and lexicography, a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analyzed by means of tagging (McArthur, 1992).
Firstly, it is crucial to use computers and electronic equipments to recognize speech and determine phoneme in studying language, since all of them offer a variety of benefits for handling text and speech. In spite of that, researchers manipulate data easily and rapidly searching, sorting, and etc. Also, process data accurately, consistently, reliably, without human bias and automatically annotate data. This reason shows that electronic corpora tool also is useful and helpful to collect of examples for linguists, data resource for lexicographers, and train material for natural language processing (NLP) applications (Dickinson, 2008).
On the other hand, collecting data has some drawbacks and problems for linguists while they are doing this process. Linguists must be kept all capture audio because of helping functional analysis out of the speech data, situational information must be kept as much as possible. As a result, a detailed log keeping is required for every piece of recording. Also, interview in sociolinguistic does not supply a perfect picture of natural speech interaction. However, some researchers argue that interviews can also perfectly represent talk in action (Labov, 2008). But according to some reserchers create some recordings of interview to compare this study. It provides that in all of data gathered, it is not easy to discover instances of some intuitively very frequent linguistic facts.
2.5 What is corpora tool?
The tools of speech corpora are more significant because linguists are able to collect the data effortlessly. Thus, speech corpora depend on these tools to be generally used for creating a speech corpus via either audio recordings or text-based transcriptions. Recordings may be made via sound storage technologies and stored to create a corpus. In the previous sections discussed text and speech corpus and also they are completed each other. All experts are agreed that the speech corpora are used for two significant fields. First, the speech corpora are useful in speech technology that is helpful among other things to make acoustic model. Secondly, in Linguistics spoken corpora are assisting to do the research about phonetic, conversation analysis, dialectology, and also translating and transcription auditory speech (voxforge, 2006).
It is obvious for constructing the huge speech corpora with its linguistic and signal annotations, corpora constructors use annotation recorder tools that are able to decrease tardy and time-consuming tasks. The annotation recorder tools help the constructors are made gigantic and linguistically annotated corpora and used a set of functions accurately, for instance, “linguistic processing and signal processing, grapheme-to-phoneme conversion, automatic phonetic alignment, and even language model generation”. Nevertheless, earlier speech tools only study phonetic level and signal level, and later have been built up for just single type of application(reference).
2.6The review of previous corpora tool
First, the term of corpus in linguistics have been appeared in the near the beginning of 1980 (Leech and Fligelstone, 1992). The corpus’s history comes back to the pre-Chomskyan period while it has been used by linguists’ domain, for example, Boas (1940) and linguists of the structuralize tradition, Newman, Bloomfield and Pike (Biber and Finegan, 1991). Even though, shoeboxes filled with paper slips would have been used by linguists rather than computers, it shows that the ‘corpora’ would have been contained a simple collections of written or transcribed texts. In the notification of Scenery and Wilson (2001) explained the fundamental corpus methodology have been appeared commonly in the early 20th century.
With evolutions in technology, and especially the growth of ever more powerful computers offering ever developing processing power and enormous storage at comparatively low cost, the exploitation of huge corpora became suitable. Moreover, depending on progressing electronic and technology, corpus-based studies have increased dramatically because there are many projects that have been implemented and developed for collecting data to create massive corpora.
Each project has a particular method to collect and analyze system. It is obvious the main purpose behind to developing this tool is to assist experiments to collect natural data and obtain it an easier way. As discussed before developing technology, it is hard for linguists or experiments to collect huge data. Fortunately, when the technology is developed, the corpus tools are vividly raised. Moreover, the system to collect the data is different according to the developing the tools and the way which chosen to collect the data. The tools are collected data and analyzed data by various techniques, for instance, The CHILDES Project is a tool for analyzing talk also this tools developed by Brian MacWhinney in Carnegie Mellon University. The purposes of developing a system to exchange to language data immediately and its obvious to anyone who has produced or analyzed and also it can transcript. Consequently, this tool like three tools which can automate the process of data analysis, get better data in a consistent, fully-documented transcription system, and supply more data for more children who has more ages and more languages (MacWhinney, 2012).
On the other hand, the experiment is focused on the projects that run on the server. Since, participant can access the project easily and it is not hard for researcher to obtain the data. Also, the programmers develop the program according to the experimenter’s requirement. It seems users can participate from all place in the world and assist the researcher to receive the data in the massive sample during short time and without consume huge cost. The client-server tool assists users to collect all participant information and stored in the database because experimenters deal with his/her speech according to the personal data.
For instance, the VoxForge established to collect transcribed speech for use with Freeand open Source Speech Recognition Engines. This tool runs on server and collect audio and the participant information. This tool can developed for accumulating several languages such as English, French, Bulgarian, Dutch, etc. The tool generates the text for user randomly to read it and play the record audio so that the user would be sure to record his/her speech and then submit it. The important parameter in the tool, it collects the information that related to the user, for example: gender, range of age, pronunciation dialect and microphone type. However, user should be select the type of microphone as determining in the drop-down box (as shown figure 2:5), but it is difficult to differentiate the these types for the participant such as USB Desktop Boom microphone. Thus, developer should more specify the user to introduce these types.
Figure 2:5 view drop-down box that shows types of microphone in VoxForge
Moreover, the Speech Resources Consortium (NII-SRC) has been created to collect and distribute speech data for research on speech recognition and speech synthesis. NII system to enable to cope with the speech of different people from different locations and situations, according to this project is necessary to collect as many samples as possible and it should be in different aspects of the data such as environment, age, language, etc. NII system shows that speech corpus used in speech technology, Language education, and Linguistics. All data used in the experiment depend on the collection data as explained in the figure 2:4 (Itahashi, 2005).
Figure (2:4) describes several various aspects of the data
In general, most previous tools have been established for a single type of application domain. Meanwhile, the technology progressed, several tools developed to collect and analyze data, but few of them run on client server and according to the expert request the tool developed. Each tool has built up by a particular developer, organization or a university. Other aspect can be compared among tools is Natural language processing which is a subject in computer science , machine learning, and linguistics related to interact between computers and human languages (Charniak, 1985). The table 2:1 is shown some previous tools and their project developers with the method to collect the data as well as the domains are used.
| Tool Name |
Developer |
Collecting data |
Domain Covered |
| VoxForge |
VoxForge organization |
Developing by Client-server. The user can participate from any location. Participant reads sentences and records his/her speech. And then submit]ed it |
To view Subversion repository containing VoxForge Speech Audio files, Acoustic Models, and Scripts. |
| the speech accent archive
(Weinberger, 2012) |
Steven Weinberger
(George Mason University) |
Website, upload the audio and then submit. The data includes user information and the audio file which recorded before. |
Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers. |
| A Self-Labeling Speech Corpus
(McGraw, Gruenstein and Sutherland, 2009) |
Ian McGraw ,Alexander Gruenstein
and Andrew Sutherland |
Collecting and transcribing speech data by using online educational games. |
Using the AMT (stand for Amazon Mechanical Turk) majority labels as a reference transcription. |
| Gruenstein,2009 |
Alexander Gruenstein |
Spontaneous speech is collected via a web-based memory game. |
Partially annotate and gather spontaneous speech. |
| NeoSpeech, (Neospeech, 2011) |
NeoSpeech |
TTS Web Service, this server automatically record information and data speech |
Providing of speech-enabled solutions based on a suite of best-of-breed core capabilities in speech recognition, speech synthesis, speaker verification and voice animation. |
| Table 2:1 |
As discussed before a web system to collect of text-source speech corpora has been provided by the voxforg speech group. Speech recordings for main European languages and speakers can record audio directly on the website. As described in the table 2:1 Gruenstein collect spontaneous speech by a web-based memory game. To compare among the above two works which devote to collect of data for major languages.
2.7Summary
In conclusion of this chapter, several various aspects were discussed, and focused on some works that had been implemented. The reviewed literature of works was concentrated mainly on the beneficial to provide the best tool in the future. However, the beginning of chapter was introduced the topic to the reader and the advantage of the issue. Consequently, it was argued that the tool was used in which domains.
Moreover, it discussed the significant to collecting data classification which led to help the experiments to investigate the speech sound in speech technology and linguistics. As a result, several points were concluded to collect the audio sound by the best tools and different environment. It determined the method to collect the data as well as changing according to the investigation. At the end of the chapter, focusing on some best and worst situation of some works that developed before in that the developer would repeat the best points in the future works and amend poor state. Ultimately, this review was showed that developing text-speech corpora has the great role in the technology and language pronunciation. These tools were able to collect and thousand hours of speech as well as analyzing the collection of data speech.
Chapter three: Technical Background Review
In the previous chapter discussed how the data is collected and the type of tools which is developed to create speech corpus. This chapter discusses the technical method that is utilized for developing these tools. For this reason evaluating the applications and programming languages that are used for establishing these tools which are provided data via based-online server. This chapter includes several sections, first section technically discusses and evaluates two previous tools that provided to collect speech corpus. ….
3.1 The technical corpora tools overview
There are several tools are developed to gather data corpus and annotate data corpus, but each tool has provided in various programs and used the different programming language. Generally, this paper should be pointed on the tools that are run on via client-server. For this process, it is necessary to deal with the technique of these tools that are used for developing. Additionally, the technique is studied program language and the type of the database to save the data, also focused on the type of the server that used for running the system.
The significant example is VoxForge that is developed and collected speech data via web-based interfaces on client-server which is discussed typically in the earlier chapter, and established by Ken MacLean who is the creator, maintainer and administrator of the VoxForge.org website. This website collects the data and creates a repository of transcribed speech audio files, acoustic models and scripts for use with open source speech recognition software. It means that VoxForge is built to gather transcribed data for using in Open Source Speech Recognition Engines. According to the system user can submit his/her speech audio recordings to VoxForge for creating of GPL Speech Corpora and Acoustic Models. GPL stands for GNU General Public License is free license software. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is purposed to guarantee your freedom to share and change all versions of a program and make sure it stays free software for all its users (Smith, 2007). MicrosoftResearch (2012)is defined Acoustic model “as a modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform”. The acoustic model concentrates on a label called a phoneme assigns each of these statistical representations. About forty distinct sounds have contained the English language, and it has forty different phonemes that are useful and helpful for process speech recognition in the speech technology. For this purpose, the system of VoxForge.org is used java applet (figure 3:1 is shown the java applet in VoxForge) to develop the system. To record speech and get participant’s information in the system, the developer provide the system by java applet because it is easily integrated with web-server and can be embedded into any web environment to present high performance image viewing and without client installation that can be manipulated. In addition, it is very easy to integrate with HTML files since developer can be embedded with the system by a few lines of code. As a result, most of the developer depends on it to establish a perfect system (Oracle, 2012).
Figure 3:1 is shown the java applet in VoxForge
The Speech Accent Archive system is another web example tool that developed by Steven Weinberger in George Mason University that discussed in the previous chapter. This website system is provided via PHP programming language. Generally, the Speech Accent Archiveis established to collect data speech corpus and transcript data but it is deferent from VoxForgewhen it compares together because the second one is collect data via online and used java applet but the first one is used the program is called PolderbitS Sound Recorder which is shown in the figure 3:2.The program must be downloaded by participant to record his/her speech. After user completes recording, should be uploaded the audio capture via website. There are numbers of differences between the two system, for instance, VoxForge is given the text randomly without participant choose the specific text and know that any idea about the type of text, it is shown in the figure 3.1. However, Speech Accent Archive obliges user to download the particular paragraph to be read by each participant.
Figure 3:2 PolderbitS Sound Recorder program interface to record audio
(Weinberger, 2012)
3.2 The Programming Language for implementing corpora tool
To establish speech corpora should be collecting speech with participant’s information. For this purpose, it needs the tool to be able to record audio. There are several programming languages to make a system that has a capability for doing this task. For example: Java, C, C++, C#, MATLAB, etc. Since most of the web-based tools which discussed are implemented by java programming language, this research is focused on this language and how record audio by this language.
3.2.1 Recording sound in Java programming Language
To capture sound by java language, it needs to be familiar with a method which depends on programmers to capture the sound. At the binging, the developer needs to know the construction of sound which has been recorded. Solution Inc (2009) argued that “From a human perspective, sound is the sensation that we experience when pressure waves impinge upon the small parts contained within our ears.” Also, it showed that it is the normal outcome of pressure waves which have been transmitted in air. However, sound pressure waves are not limited to air. For example, if someone swims underwater, way of water may create sound pressure waves from his ear (Baldwin, 2003).
The point of view of the Java Sound API, the sound of word takes on somewhat various meaning. nonetheless, it may be fair to say that the purpose behind the sound API is help a developer to write programs which leads to sound pressure waves which influenced to impinge upon ears at particular times (Solution Inc, 2009).
Baldwin (2003) is brought from the Sunthe Java Sound “API is a low-level API for effecting and controlling input and output of audio media. It provides explicit control over the capabilities commonly required for audio input and output in a framework that promotes extensibility and flexibility.”
Java Sound API supplies the lowest level of audio support on the Java platform. It provides the application programs a high level of control over specific audio efficiency and it is expandable. For example, it provides installation by specific mechanisms, operating and accessing system resources such as audio mixers, MIDI devices, writing or reading files, and sound format converters. In addition, it does not include advanced sound editors and GUI tools. However, it provides a set of abilities upon which such usages that is able to be built. It concentrates low-level control beyond that normally predictable by the client. Consequently, the users get advantages from higher-level interfaces built on top of Java Sound (Oprea, 2005). There are several of quite multifaceted issues concerned with the use of the Sound API. Therefore, the specific part for this section is briefly devoted to supply an introduction for some of those issues which related to capture audio in Java.
i. Packages
At the binging, Packages should use the particular java API which is used for sound. According to Baldwin (2003) there are two significant various types of audio (or sound) data are consolidated by the API:
First: Sampled audio data
Sampled audio data may include as a series of digital values which illustrate the capacity, consistency or vehemence of sound pressure waves. For instance, In Figure 3:3, the graph might represent a set of sampled audio data which is generated by a wide-band noise generator, such as the noise at an airport.
Figure 3:3 Sampled audio data
Thus, this type of audio data is supported by the following two main Java packages:
· javax.sound.sampled
· javax.sound.sampled.spi
These two packages are specified interfaces for capture, mixing, and playback of digital (sampled) audio.
Second:MIDI data
MIDI stands for Musical Instrument Digital Interface. MIDI data may be made usual musical sound or special sound effects. This kind of audio data is included by the following two Java:
· packages:javax.sound.midi
· javax.sound.midi.spi
These two packages are supplied interfaces for MIDI synthesis, sequencing, and then event transport.
Moreover, to permit service suppliers and create custom mechanisms which are able to be installed on the system, supplier should depend on spi packages.
In the processing of sampled audio data, it could be mentioned another expression that is called entitled Digital Signal Processing (DSP) in Java because DSP techniques are often used in the processing of sampled audio data. It takes actual signals such as audio, video, pressure, or positions that have been digitized and then mathematically manipulate them. A DSP is considered for presenting functions of mathematic very quickly, for example, addition, subtraction, multiplication and division (Analogue Devices, 1995).
ii. Mixers and Lines
The Java Sound API is based on the concept of mixers and lines. According to astralsound(2003) “An audio mixer is a device that mixes two or more separate signals. Mixers range from a couple of variable resistors with knobs to the big and complicated-looking consoles used in the largest multi-performance events.” In fact, mixer combines real audio that has multiple input lines and at least one output line. The former are often samples of classes which implement SourceDataLine, and the latter, TargetDataLine Port object also are either source lines or target lines (Pfistere and Bomers, 2005).
Thus, a mixer is really a traffic manager. A signal is joined to an input, and the former point it to one of several possible outputs. Some mixers have implemented several stages to mix it, where inputs are mixed to sub mixes, or groups, and then the groups are further mixed to a stereo output.
Moreover, Petrauskas (2005) defined a line “is an element of the digital audio “pipeline,” such as an audio input or output port, a mixer, or an audio data path into or out of a mixer. The audio data flowing through a line can be mono or multichannel (for example, stereo). … A line can have controls, such as gain, pan, and reverb.”
To input a simple audio from input system, this used four terms that showed in the figure 3:4. Also.it showed that Mixer object is put together with one or more ports, some controls, and a TargetDataLineobject.
Figure 3:4 an audio input system
Thus, data flows into the mixer from one or more input ports, usually the microphone or the line-in jack. Control (Gain and pan) are stratified, and the mixer transports the captured data to an application program via the mixer’s target data line. A target data line is an output of mixer, including the mixture of the streamed input sounds. The simplest mixer has only one target data line, but some mixers are able to deliver captured data to multiple target data lines concurrently (Wang, 2001). The data provided by the TargetDataLine object can be pushed into some other program construct in real time. The actual destination of the audio data can be any of a variety of destinations such as an audio file, a network connection, or a buffer in memory. This means that TargetDataLine is a sub-interface of DataLine, which in turn, is a sub-interface of Line. So, line has several types are defined by sub-interfaces of the basic Line interface. The interface hierarchy is shown in the figure 4:1 for more explaining the relation between dataLine and TargetDataLine.
Figure 3:5 the Line Interface Hierarchy
The Java Sound API does not suppose a particular audio hardware composition; it is provided to permit various kinds of audio components to be installed on a system and accessed by the API. The Java Sound API supports usual functionality, for example, input and output from a sound card (such as for recording and playback of sound files) as well as mixing of multiple streams of audio. For more explanation, figure 3:6 is shown an example of a typical audio architecture.
Figure 3:6 A Typical Audio Architecture
In this example that shown in the figure 3:6, it shows a device such as a sound card has various input and output ports and mixing is supplied in the software. The mixer might obtain data which has been read from a file, flowed by a network, developed on the fly by an application program, or generated by a MIDI synthesizer. The mixer joined all its audio inputs into a single stream, which is able to be sent to an output device for rendering (Oracle, 1995).
To conclude, A TargetDataLine obtains audio data from a mixer. Generally, as mentioned before that audio data from a port such as a microphone has been captured by a mixer; it might develop or mix this captured audio previous to placing the data in the target data line’s buffer. The TargetDataLine interface supplies techniques for reading the data from the target buffer of data line and for locating how much current data is available for reading.
Furthermore, A SourceDataLine receives audio data for playback which must be available for all simple sound system. It provides methods for writing data to the source data line’s buffer for playback which is opposite with TargetDataLine, and it is used for determining how much data the line is ready to receive without jamming.
Ultimately, A Clip is a data line into which audio data can be loaded prior to playback. The clip’s period is recognized before playback, and users can choose any starting point in the media because the data is not loaded perfectly rather than streamed. Also, Clips can be looped; it means that upon playback, all the data between two specified loop positions will reiterate a particular number of times, or indefinitely (Oracle Java Technology, 1993).
iii. GUI
To control Java Sound API; it should be used GUI which Stands for “Graphical User Interface and also it is pronounced “gooey.” It returns to the graphical interface of a computer which permits users to click and drag objects on the button and interred by text box and so on instead of entering text at a command line (Gladden, 2000).
Baldwin (2003) pointed that in the simple Java Sound API, simple GUI is managed the program by three buttons as shown in figure 3:7; by clicking on the Capture button input data from a microphone is captured and saved in a ByteArrayOutputStream object. Data capture stops if the user clicks on the Stop button. Playback of the captured data begins while the Playback button is clicked.
Figure 3:7 Simple GUI example for controlling Java Sound API
3.2.2 Java Applet
In spite of that java applet and GUI has the huge relation together because Java applet designed by GUI. Many references specifically has discussed and defined Java Applets. There are some differences in the expression, but all of them has the same gaol and focused on the applets that are able to perform in the web browser and written by java programming language.
Bishop (2006) identified that Java applet are programs which written in the Java programming language that can be embedded into web pages. A Java applet is a Java program capable of doing more complex tasks than a JavaScript. Thus, the applet still necessitates to be run in a Web Browser, but does not have perfect access to machine the same as a stand alone Java program does.
Consequently, Oracle Cooperation for a Sun Developer Network(2010) indicated that “An applet is a program written in the Java programming language that can be included in an HTML page, much in the same way an image is included in a page. When you use a Java technology-enabled browser to view a page that contains an applet, the applet’s code is transferred to your system and executed by the browser’s Java Virtual Machine (JVM).”
These descriptions have proved that applets have own advantages and disadvantages. The significant task of java applet is sturdy security because java resolves this issue by restricting applets to Java’s execution environment and prohibiting access to system resources (Janalta Interactive Inc, 2012). Moreover, the java applets is a high performance because of crossing platform and being able to run on Windows, Mac OS and Linux platform a well as working all the version of Java Plug-in. Most web browsers such as firefox, explorer, Google-chrome, safari, etc. support applets. User is able to also have accessing to the machine completely if the user allows. Nonetheless, applets have some problems until now the experts cannot tackle them. For example, for running applets, it should be required installing java plug-in and java applet requires JVM so first time it takes important start up time. It’s tricky for developer to design and build nice user interface in applets compared to HTML technology (Rose India, 2007).
For instance, the java applet interface depends on GUI; it is difficult to design nice interface by GUI, i.e., it is clear in Voxforge interface which appeared luck of design interface.
Java Applet is more helpful for web developer to design web voice recording and allows participant to capture the voice from web site. It compresses the voice and sends to the web server via HTTP. Moreover, it can playback the recorded voice from the server the embedded voice streaming player or a separate player as a voice streaming applet can be used. The functionality of recording applet is started by capturing voice from sound card, i.e. the sampling frequency is 8000Hz and may be voice compression to 4800bps ( 36K per minute). After, it uploads of the voice file to web server via HTTP. The kind of server scrip -such as apache, apache tomcat servlet, or etc. – is used to receive voice file on the serve. Consequently, the capture sound can be saved and play backed from web server. These general steps are happened in all web application which devoted to capture sound speech by java applet (VIMAS Technologies, 2007).
3.3 The client-server task to run on corpora-tool
IT IS OBVIOUS;CLIENT SERVER TECHNOLOGY IS A SINCE DIRECTLY RELATED TO A NETWORK WHICH IS USED TO SHARE AND DISTRIBUTE COMPUTING SYSTEM IN WHICH THE TASKS AND COMPUTING POWER ARE SPLIT BETWEEN THE SERVERS AND CLIENTS. USUALLY, THE SERVERS HAVE STORED AND PROCESSED DATA COMMON TO THE CLIENTS ACROSS THE ORGANISATION AND THESE DATA CAN BE ACCESSED BY ANY USER. GENERALLY, SERVER ALSO ASSISTS TO RUN ON THE SYSTEM AND APPLICATION THAT HAVE BEEN DEVELOPED FOR A SPECIFIC REASON. IN THIS NETWORKING REQUESTS ARE MADE BY DIFFERENT CLIENTS TO THE SERVER. SERVER THEN PROCESSES THE REQUEST AND PROVIDES THE DESIRED RESULT TO THE CLIENT. THE CLIENT SERVER ARCHITECTURE IS MULTILATERAL, SUPPORTS GUI (I.E. JAVA APPLET) AND HAS MODULAR INFRASTRUCTURE. THE TECHNOLOGY IS ILLUSTRATED AS A TECHNOLOGY F