HNC Information
The HNC corpus of Institute for Language and Speech Processing has been developed over many years and it comprises over 97.000.000 words today, while it is being continuously enriched. All documents of HNC are carefully selected so as to reflect the actual usage of Modern Greek.
►Written Language
The HNC corpus includes samples of written language exclusively from printed resources or the Internet. The current corpus version does not comprise any texts of spoken discourse.
►Modern Greek
The documents included in the HNC corpus have been selected as cross sections of the Modern Greek language and are dated, in their majority, from 1990 and beyond. Documents with dialectic or other peculiarites were excluded and documents with high readability (high circulation newspapers, best selling books) were selected.
►Source variety
In order to demonstrate the full range of usage in a balanced way, texts were selected from a variety of sources, covering many genres and topics.
These documents have been legally provided to ILSP by their owners only for research purposes.
For every document included in the HNC corpus, the following information is provided:
- Bibliographic data
All important elements for recognizing a document like the title, the author, the publisher, the translator (if it's a translated document) and the publication date. - Classification data
Required information for the classification of texts:- their medium
- their genre and
- their topic/content.
In addition, there are further subcategories, such as detailed topic and detailed genre, which will be enriched gradually as texts are added to the corpus collection.
►Classification by publication medium
HNC documents are classified by their publication medium into the following categories:
- Book: every kind of book
- Internet: documents taken from the Internet
- Newspapers: daily or weekly newspapers
- Magazines: weekly, bi-monthly, monthly etc. publications
- Other: every type of document not included in the above, like:
- informative or advertising leaflets and brochures/li>
- reports, applications, legal documents, proceedings, announcements
The ratio of HNC texts in its current version according to their publishing medium is the following:
Βιβλίο: | 252 | 0.22 % |
Διαδίκτυο: | 63,813 | 55.74 % |
Εφημερίδα: | 46,649 | 40.75 % |
Περιοδικό: | 2,127 | 1.86 % |
Άλλο: | 1,647 | 1.44 % |
►Classification by genre
The classification of HNC documents based on their genre includes the following:
Genre | Description | Example |
---|---|---|
Biography | their genre and texts related to personal and everyday life, such as biographies and autobiographies | «Μάης 36: Αναμνήσεις ενός πρωταγωνιστή» |
Opinion | basic articles of the press, reviews, permanent columns, essays, scientific publications, dissertations, scientific books, columns with subjective comments, humorous or chronographic content, quoting articles in other publications and generally texts that express some subjective point of view | «Υπολογιστές στην εκπαίδευση: πώς και γιατί» |
Advertisememt | various promotional texts, brochures, spots, etc. as well as any text that announces events | «Το Ίδρυμα Ελληνικού Πολιτισμού εξορμά σε Αμερική και Ευρώπη» |
Official documents | legal texts, administrative reports, assessments, proceedings of the Greek Parliament, extracts from the Government Gazette, applications, official letters | «Σύνταγμα της Ελλάδας» |
Private documents | personal letters, calendars | «Μονόλογος οργής και απόγνωσης» |
Literature | literary works, scripts, fairy tales, lyrics | «Η μητέρα του σκύλου» |
Informing | informative texts such as news, reports, responses, questionnaires, weather / news reports, polls, official references, textbooks, tourist guides, bibliography, encyclopedias, educational books | «Ταχύπλοα: Διασκέδαση με κανόνες» |
Conversation | conversations, speeches, interviews, letters or articles appearing in the form of a letter (all in written form) | «Η ιστορική συνέντευξη στο ABC» |
Not classified | texts that do not fit into any of the above categories |
►Classification by topic/content
HNC documents based on their topic/content are classified in the following categories:
Topic/Content | Description | Example |
---|---|---|
Occupation | Leisure, Sports, TV, Automotive, Motorcycle, Shopping, Housing, Astrology, Fashion, etc. | «Μπράβο Σπόρτινγκ!» |
Geography | Geography, Travel, Cities, Anthropology, Folklore, etc. | «Οι παγίδες στα λιμάνια του Αιγαίου» |
Science | Science, Technology, Mathematics, Environment-Ecology, Space etc. | «Η Ανθρακική Πλατφόρμα Παρνασσού κατά το ανώτερο Ιουρασικό-κατώτερο Κρητιδικό: Στρωματογραφική διάρθρωση και Παλαιογεωγραφική εξέλιξη» |
Business | Business, Economy, Advertising | «Πονοκέφαλος ύψους 1,5 τρισ.» |
History | History, Archeology, Art History, Biographies κτλ. | «Ένα ταξίδι στην ιστορία που καταξιώνει το μύθο» |
Society | Politics, Sociology, Law, Defense, European Union, etc. | «Διαλύεται 1 στους 3 γάμους στην Ε.Ε.» |
Arts | Human Sciencies, Book-Letters, Philosophy, Religion, Archeology, Art, Education, etc. | «Αυτός που έκανε το κόμικς τέχνη» |
Health | Health, Medicine, Psychology, Pedagogy, Veterinary, etc. | «Έμφραγμα: Μεγάλος κίνδυνος οι μικρές βλάβες» |
Not classified | texts that do not fit into any of the above categories |
►Why is the HNC corpus of ILSP useful?
The integrated environment that has been developed offers the language learner the linguistic material and the necessary computational tools to process it, so that he/she can draw the information needed. These tools manipulate the texts and draw speech samples according to the defined search criteria. For further details see Help.