Unveiling Corrupt Practices in Public Procurement Across Europe: A Text Mining Approach
        
            
                
                        Methods
                        Quantitative
                        Corruption
                 
             
        
            
            
            
                To access full paper downloads, participants are encouraged to install the official Event App, available on the App Store.
            
        
        
        
        Abstract
        It is widely recognized that corruption is hard to measure, especially its types involving politico-business elites. Public procurement has become one of the most promising areas of government activity where measurement significantly advanced in the last decade. Still, large gaps remain: most corruption measurements and approximations make use of structured data on visible features of the public procurement process. While this makes measurement more tractable, it risks missing more intricate and subtle forms of corruption embedded in tender texts. Given that they are harder to track, they may be the preferred mode of corrupt contracting.
In order to address this gap and advance the corruption measurement landscape, this paper employs text mining techniques to explore hidden barriers to open competition and uncover corrupt restrictions in public procurement across countries.
The analysis makes use of tender documentations from official government publications in Hungary, Italy, and France. By focusing on detailed procurement tender-level textual information, the research seeks to predict limitations to open competition associated with corruption, offering insights beyond established indicators. Utilizing a text-as-data approach, the research incorporates textual information describing purchased goods and services, as well as conditions for tenderers, like prior experience requirements. Building on qualitative case studies, the study demonstrates that corruption often manifests through nuanced conditions embedded in lengthy tendering documents, influencing the elimination of competitors.
Besides employing Natural Language Processing techniques, the research replicated past studies that predicted single bidding on otherwise competitive tenders -- a crucial indicator of corruption. The models, based on Logistic Regression and Random Forest algorithms trained on word n-grams, outperform baseline models. The analysis included control variables such as year, product code division, bid price, location, and buyer type. The research extends its scope by dissecting various components of the text, including tender requirements, award criteria, and product descriptions. We found that product descriptions are the most impactful in predicting corruption risks. The research goes beyond prediction, aiming to interpret the models and deepen our understanding of corrupt practices.
Our multi-country approach enhances measurement validity and reliability, enabling a comparative examination of corrupt practices across different contexts.