Natural Language Processing (NLP) is the field of science dealing with modelling, elaborating and interpreting text written in natural language so that a computer can understand it.
We are focused on extracting concept such as diseases, treatments, drugs and exams from medical texts.
We deal with the Italian language, even if the majority of current research is done on the English language.
For this task, called Named Entity Recognition (NER), we use state-of-the-art Machine Learning tools written in Python (spaCy/gensim), using both pre-trained language models and newly developed ones.
Our initial challenge is to create a labelled dataset in Italian to train new models, to enable algorithms to infer concepts from the linguistic structure of the text.
The database is created with open-source, web-based softwares which are easy to use and do not require prior knowledge.
The new dataset will be integrated with the semantic database DBpedia and released as open-source with the scientific community, to foster research and enable a higher level of accuracy in medical text processing.