Introduction

The MANIFESTO Babel Project uses the Manifesto Project codebook’s values to identify the specific policy area or ideological stance for a text, ranging from foreign relations and economic policies to social issues, cultural values, and demographic considerations.

The codebook distinguishes 7 domains in the categories: external relations, freedom and democracy, political system, economy, welfare and quality of life, fabric of society, social groups.

The language models that the MANIFESTO Babel Machine uses were fine-tuned on training data containing only the Main-Categories of the coding scheme (the three-digit variables), excluding the Sub-Categories (the four-digit variables and the four digit variables with underscore). We use the label 0 to indicate that the given text contains no meaningful informations.

We have language-specific models for the following languages: Armenian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Georgian, German, Greek, Hebrew, Hungarian, Icelandic, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish and Turkish, but we encourage you to also submit datasets not covered under this list, as results may be useful for additional languages due to the nature of large language models.

You can upload your datasets here for automated MANIFESTO-coding. If you wish to submit multiple datasets one after another, please wait 5-10 minutes between each of your submissions. There are two possibilities for upload: pre-coded datasets or non-coded datasets. The explanation of the form and the dataset requirement is available here.

The upload requires to fill the following form on metadata regarding the dataset. We kindly ask you to upload your dataset, and in case of a pre-coded dataset, if available, please attach the codebook used besides the dataset.

The non-coded datasets should contain an id and a text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them.

Pre-coded datasets must contain the following columns: id, text, cmp_code. The column names must be in row 1. Uploading a pre-coded sample is optional, but it can help us with calculating performance metrics and fine-tuning the language model behind MANIFESTO Babel Machine. The detailed rules of validations are available here. The mandatory data format of cmp_code is numeric. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. Automatic processing requires to follow these rules.

After you upload your dataset and your file is successfully processed, you will receive the MANIFESTO-coded dataset and a file (both in CSV format) that includes the three highest probability category predictions by the MANIFESTO Babel model and the corresponding probability (softmax) scores assigned to each label. Please be aware that interpreting softmax scores as absolute model confidences could lead to false assumptions about model performance.

If the files you would like to upload are larger than 1 GB, please reach out to us with the download link attached (such as Dropbox or Google Drive) using our contact form.

If you have any questions or feedback regarding the MANIFESTO Babel Machine, please let us know using our contact form. Please keep in mind that we can only get back to you on Hungarian business days.

Submit a dataset:

For files that are small in size and/or datasets where prediction speed isn't a critical priority, we recommend using the CPU option.

The non-coded datasets should contain an id and text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. All datasets must be uploaded in a CSV file format with UTF-8 encoding.

Loading...

    The research was supported by the Ministry of Innovation and Technology NRDI Office and the European Union, in the framework of the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project.

    The research leading to these results has received funding from the European Union Horizon 2020 Framework Programme (H2020) under grant agreement no 101008468.

    On behalf of the Babel Machine project we are grateful for the possibility to use HUN-REN Cloud (see Héder et al. 2022; https://science-cloud.hu).