![]() |
JAR Library |
Tutorial for BioTex Jar Library
Follow the instructions below
Hide
-
Before using BioTex, it is necessary to install TreeTagger (part-of-speech tagger tool, which assigns the grammatical category of a word).
So, you must install TreeTagger for French, English and Spanish following the instructions from: TreeTagger Installation (Unix or Windows) -
Download BioTex for Unix or Windows.
You can use it for 3 languages (EN, FR, ES). -
After unzipping the file, you will find:
- dataSetReference folder = it contains three dictionaries (EN, FR, ES), which are used to automatically validate the extracted terms.
- patterns folder = it contains the linguistic patterns to form the terms.
- stopWords folder = which contains the list of list of stop words to filter the extracted terms.
- JarBioTexExterne.jar file = BioTex Jar library containing the measures used for term extraction.
- Principal.java file = the main class to call the java library and the parameters settings. For more details regarding the parameters, you can find them on the Tutorial for BioTex Web Application.
-
OUTPUTS: the application gives 5 files as output:
- ALL_gram.txt = a complete list of all extracted terms.
- t1gram.txt = a term list of 1 gram (e.g., aspirin).
- t2gram.txt = a term list of 2 grams (e.g., clinical cases).
- t3gram.txt = a term list of 3 grams (e.g., extended surgical procedure).
- t4gram.txt = a term list of 4 and more grams (e.g., early onset familial alzheimer disease).
-
Each file got as output has 3 columns, each column is separated by ";".
- First column = shows extracted terms.
- Second column = shows 1 if the term is already in a biomedical thesaurus, or 0 otherwise.
- Third column = the importance of the term in the dataset, it means the value which gives the selected measure.
![]() |
Web Application |
Tutorial for BioTex Web Application
Follow the instructions below
Hide
-
Put the number of linguistic patterns to use
By default is 200 for English, French, and Spanish. You can change it, if you decrease the number, the precision of true biomedical terms extracted will increase. -
Select the type of document to extract
If you choose All Terms, it means that you will extract single-word terms and multi-word terms. If you choose Multi Terms, BioTex will extract only multi-word terms.
For exemple:- Aspirin (single-word term)
- extended-release capsules (multi-word term)
- attention deficit hyperactivity disorder (multi-word term)
-
Choose the measure to rank the candidat terms
You can select any measure, the only difference is that the L-value, C-value measures work on a single large document. LIDF-value and the other measures work only on a set of documents. -
Select if your corpus is a single document or a set of documents
If you want to use the LIDF-value or the AKE measures or the new combined measures, you have to put into a single document the set of documents separating them by ##########END##########
Example of a set of document into single document:
text of document 1
text of document 1
...
text of document 1
##########END##########
text of document 2
text of document 2
...
text of document 2
##########END########## -
Choose your text file and the language
Choose your text file and the language of your text (English, French or Spanish). -
Click the button "Extract Terms"
The application will give you biomedical candidate term clicking the button "Extract terms".