Home Page Title
User Details Text Selection Queries Statistics Help
 

Information about the Hellenic National Corpus (HNC)

User guide

Corpus information

List of sources

User guide

This page contains information as well as instructions concerning the use of the HNC.

System requirements

The only system requirement is an Internet connection and a Web Browser (preferably Microsoft Internet Explorer or Mozilla Firefox).

Users
There are two kinds of users.

Guest: As a guest you don’t need a username or a password.
You can have access to the HNC with the following restrictions:

1. You won’t be able to select a sub-corpus. Your queries will be restricted within the HNC as a whole.
2. Your query results cannot exceed a total of 5 sentences.
3. "Statistics" will show elements of two words maximum.

Subscriber: In order to have full access to the HNC, you must be a subscriber.
On your subscription you will be given a username and a password, which are strictly personal. As a subscriber you have the following rights:

1. You can make queries within the whole corpus, as well as within any sub-corpus you have selected.
2. You can make queries based on words, lemmas or parts of speech.
3. Your query results can reach a total of 2,000 sentences.
4. You can retrieve any sub-corpora you have used in the past.
Subscription Information
Subscriptions to the HNC last for 6 months and can be renewed automatically upon request.
Types of subscription include:

1. Private Persons (1 user) - for research purposes only
1.1 Standard fee
1.2 Student fee

2. Organisations - for research purposes only
2.1 1-5 users
2.2 6-10 users
2.3 11-30 users
2.4 more than 31 users

3. Companies - for research/commercial use
3.1 1-5 users
3.2 6-10 users
3.3 11-30 users
3.4 more than 31 users

For further information concerning subscriptions (process, costs etc.), please contact hnc@ilsp.gr.

Return to Top

Text Selection
As a guest, the sub-corpus you select cannot be used in the queries.
As a subscriber, you can make queries withing the whole corpus, as well as any sub-corpus you have selected according to certain parameters.

Every time you select a sub-corpus, your selection is automatically saved. You can thus go back and use the same sub-corpus at a later time.

While selecting texts, you can choose more than one items by clicking
- the Shift button + Select (to select contiguous items) or
- the Ctrl button + Select (to select non-contiguous items).

Having defined your search items, you must go on to the next step, in order for your selection to be saved.
Repeat the procedure for as many steps as you wish. You can stop the text selection process at any stage by clicking on the end of the text selection. Any selection made on that page will not be saved.
Having completed the text selection process, you can start making Queries.

Queries
The system works as following: you can make specific queries within the HNC and the system will return sentences according to the parameters you have set.

1. Queries can be made for: words – lemmas – parts of speech (or combinations of the three).
2. You must use accents, although it makes no difference whether you insert words in capital or small print.
3. You can define the maximum distance between words, lemmas or parts of speech. “Distances” refer to the maximum distance between two items within a sentence, that is, the maximum number of words between search items, e.g. pronoun [ distance 3 ] κάνω [ distance 2 ] noun (Pic. 1)


Pic.1


According to these parameters [distance 3] between the first and the second word, the system will return sentences in which there may be a distance from 1 to 3 words between the query items (Pic. 2).



Pic. 2

Return to Top


4. You can define the maximum number of sentences that will be returned by the system as well as the maximum number of sentences per page.

5. You can use wildcards to replace one character in a word.

To replace only one character, you can use the "_" (underscore) wildcard, while to replace more than one characters, you can use the "%" (per cent) wildcard.

Examples:
- ακαδημαϊκο_ (all words beginning with "ακαδημαϊκο" plus another character)
- ακαδη% (all words beginning with "ακαδη"),
- %μαϊκός (all words ending with "μαϊκός"),
- %δημα% (all words containing "δημα").

6. For every parameter, either lemma or word, you can use the logical operators OR (|) and NOT (|^).

Examples:

- αυγό|αβγό (there will be found sentences with any of these two words)
- πέν_|^πένα|^πέντ (there will be found sentences with 4 letters, the first three of which will be "πεν" but the words "πένα" and "πέντ" will be excluded)

Query results
Query results for guests cannot exceed a total of 5 sentences.
Query results for subscribers can reach a total of 2,000 sentences.

1. Results consist of a series of sentences with no specific order.

2. In order to save these sentences to your local disc, you select them, click on Copy and then paste them on an Excel (or Notepad or Word) file.

3. Every result sentence is identified by a number. This number is always the same for the specific sentence. By clicking on this number you can view information concerning the text this sentence belongs to (Pic. 3).
To return to the query results page click on your browser’s Back button.




Pic. 3
Concordances
Users can create concordances for the results. In this case, the user can choose which one of the three parameters will be used for the sorting of the words as well as the number of characters that will show up in the results.



Return to Top

Statistics
Users can have access to certain statistical data concerning the contents of the HNC by clicking on the link "Statistics" of the menu. Statistical information includes word, lemma and part of speech frequencies as well as the list of the 100 most frequent words (and lemmas) in HNC. For the statistical data users can use the wildcards (wild card) as mentioned above. Because of the use of the Morphological Dictionary for the finding of the possible lemma of every word, the results for the lemmas are not precise and therefore only the percentage results (per thousand) of the words will be returned.


The ILSP Corpus

Return to Top

The ILSP Corpus has been developed by the Institute of Language and Speech Processing. It currently contains about 47.000.000 words, while it is constantly being updated. All texts have been selected, so as to present a realistic picture of modern language use.

Written language

The ILSP Corpus contains samples of written language exclusively. Oral samples have not been incorporated in this version of Corpus.

Modern Greek language

Texts in the ILSP Corpus represent modern Greek language use most of them having been written after 1990. Texts written in highly idiomatic language have been excluded from the corpus. Most texts have been selected based on their high readability (high circulation newspapers, best-selling books etc.).

Variety of sources

In order to include different types of language, texts from several media, belonging to different genres and dealing with various topics have been selected.

These texts have been given to ILSP for this purpose by the people who hold the copyright and are available for research purposes only.

For all HNC texts users can have access to the following information:

Bibliographic data

All the information that is necessary for identifying each text, such as its title, author, publisher, translator (in case the text is a translation), date of publication.

Classification data

All the information that is necessary for classifying each text into specific categories, based on its
a) medium,
b) genre and
c) topic.

Alongside these, there are further sub-categories (detailed genre, detailed topic), which are “open” and are gradually enriched as new texts are included in the HNC.

Classification of HNC texts
Classification according to medium
HNC texts are classified according to medium into the following categories:
Books: any kind of book
Internet: texts taken from the internet
Newspapers: published everyday or weekly
Magazines: published every week, fortnight, month etc.
Miscellaneous: any kind of texts that do not belong to any of the above categories, such as:
       leaflets, pamphlets, brochures, flyers, prospecti etc.
       typed material, including all kinds of reports and documentation
       HNC texts, according to their publication medium, include at this moment:

Book9,41%
Internet0,32%
Newspaper61,29%
Magazine5,89%
Miscellaneous23,08%


Classification according to genre

HNC texts are classified according to genre into the following categories:

GENRE DESCRIPTION EXAMPLE
NON-fiction biography, including obituaries and autobiographies; sermons; school and student essays; etc. «Μάης 36: Αναμνήσεις ενός πρωταγωνιστή»
FEAture article in newspaper etc. which does not belong to INF or another, more specific genre; reviews, radio/TV magazines, etc. «Υπολογιστές στην εκπαίδευση: πώς και γιατί»
ADVertising various advertising texts, leaflets, spots etc. «Το Ίδρυμα Ελληνικού Πολιτισμού εξορμά σε Αμερική και Ευρώπη»
OFFicial laws, government circulars, official announcements, business correspondence «Σύνταγμα της Ελλάδας»
PRIvate various private texts, such as diaries and private letters «Μονόλογος οργής και απόγνωσης»
FICtion fiction and comic strips, entertainment, children’s and youth pages, jokes, games, drama film manuscripts, poems, song lyrics etc. «Η μητέρα του σκύλου»
INFormation news articles, folders and leaflets from the authorities, posters, signs etc. «Ταχύπλοα: Διασκέδαση με κανόνες»
DIScussion discussion, debate and conversation, including interviews, parliamentary speeches, letters to the editor etc. «Η ιστορική συνέντευξη στο ABC»
N/A non-applicable / mixed / unknown / unidentified / miscellaneous  


Classification according to topic

HNC texts are classified according to genre into the following categories:

TOPIC DESCRIPTION EXAMPLE
LEIsure sport, television, food, car, motorbike, shopping, home, astrology, fashion etc. «Μπράβο Σπόρτινγκ!»
GEOgraphy geography and travel, anthropology, folklore, cities etc. «Οι παγίδες στα λιμάνια του Αιγαίου»
SCIence science and technology, including mathematics, environment, space etc. «Η Ανθρακική Πλατφόρμα Παρνασσού κατά το ανώτερο Ιουρασικό-κατώτερο Κρητιδικό: Στρωματογραφική διάρθρωση και Παλαιογεωγραφική εξέλιξη»
BUSiness business and economy including advertising «Πονοκέφαλος ύψους 1,5 τρισ.»
HIStory history, archaeology, history of art, biographies etc. «Ένα ταξίδι στην ιστορία που καταξιώνει το μύθο»
SOCiety politics, sociology, law, defence, European Union etc. «Διαλύεται 1 στους 3 γάμους στην Ε.Ε.»
HUManities HUManities humanities, literature, philosophy, religion, fine arts, education etc. «Αυτός που έκανε το κόμικς τέχνη»
HEAlth health, medicine, psychology etc. «Έμφραγμα: Μεγάλος κίνδυνος οι μικρές βλάβες»
N/A non-applicable / mixed / unknown / unidentified / miscellaneous «Διηγήσεις παραφυσικών φαινομένων»
Uses of the HNC
HNC offers language researchers the language material as well as the necessary tools for handling this material, so that they can get the information required. These tools give users the opportunity to make queries and retrieve language samples according to certain pre-set parameters. Users can, thus, look for:
 
specific word forms: by entering the word “παίζω” they can get every sentence containing this word form,
lemmas (this refers to the basic form of each word, as it usually appears in dictionaries, which contains every inflected form for each word): by entering the lemma «παίζω» they can get every sentence containing every inflected form of the lemma “παίζω” that can be found in HNC texts, such as «παίζει», «παίξω», «παίζοντας»,
parts of speech (as well as morphological features): by entering «noun» they can get every sentence containing a noun, and
combinations of the three (e.g. word-lemma, word-part of speech, word-lemma-part of speech).

Table of sources

Return to Top

Publishing companies / Organizations

Γκοβόστης Εκδοτική Α.Ε.
Εκδόσεις GUTENBERG
Εκδόσεις Αθανασόπουλου - Παπαδάμη
Εκδόσεις Γαρταγάνη
Εκδόσεις Καστανιώτη
Εκδόσεις Κριτική
Εκδόσεις ΝΕΑ ΣΥΝΟΡΑ - Α. Α. Λιβάνη
Εκδόσεις Νεφέλη
Ιατρικές Εκδόσεις Λίτσας
Χ. Κ. Τεγόπουλος Εκδόσεις Α.Ε. (ΕΛΕΥΘΕΡΟΤΥΠΙΑ)
Δημοσιογραφικός Οργανισμός Λαμπράκη (ΤΟ ΒΗΜΑ, RAM)
Εκδόσεις Νέων Τεχνολογιών
Εκδοτική ΑΛΦΑ ΕΠΕ
Εκδόσεις Ψυχογιός
Βιβλιοσυνεργατική
MOTOTECH ΑΒΕΕ. (ΜΟΤΟ)
Mουσικοεκδοτική ABEE 2000 (Δίφωνο)
ΛΟΓΙΣΤΗΣ
Εκπαιδευτήρια Δούκα
NEMECIS
Διεθνής Αμνηστία
Εκδόσεις Δεσπ. Μαυρομάτη
Κέντρο Διεθνούς και Ευρωπαϊκού Οικονομικού Δικαίου-Δικηγορικός Σύλλογος Θεσσαλονίκης
ΚΥΒΕΡΝΟΓΡΑΦΟΙ
Μελέτες για την Ελληνική Γλώσσα
ΠΑΡΑΤΗΡΗΤΗΣ
ΤΟΠΙΚΑ (Εταιρεία Μελέτης Επιστημών του Ανθρώπου)
Εκδοτικός Οίκος ΣΑΚΚΟΥΛΑ ΟΕ, Θεσσαλονίκη
ΘΕΤΙΣ AUTHENTICS ΕΠΕ
Εκδόσεις Εξάντας

Καθημερινή
ΠΑΕ Καλλιθέα

Public foundations

Βουλή των Ελλήνων
Γραφείο Πρωθυπουργού
Κέντρο Βυζαντινών Ερευνών
Κέντρο Έρευνας ΑΣΟΕΕ
Υπηρεσία Δημοσιευμάτων Α.Π.Θ.
Υπουργείο Δικαιοσύνης
Υπουργείο Εμπορίου (νυν Υπουργείο Ανάπτυξης)
Υπουργείο Εξωτερικών
Εθνικό Μετσόβιο Πολυτεχνείο
Ινστιτούτο Γεωλογικών και Μεταλλευτικών Ερευνών (ΙΓΜΕ)
Κέντρο Έρευνας για Θέματα Ισότητας, Κ.Ε.Θ.Ι.
Υπουργείο Γεωργίας
Υπουργείο Εθνικής Άμυνας
Υπουργείο Τύπου & Μ.Μ.Ε.
Γενική Γραμματεία Απόδημου Ελληνισμού
Εθνικό Ινστιτούτο Εργασίας
Παν/μιο Αθηνών
Εθνικό Κέντρο Τεκμηρίωσης

 

Return to Top