Master of Science in International Management and Engineering Hamburg University of Technology, Germany 2019 - 2021
Working Student Hamburg University of Technology, Germany 2020 - 2021
Project Manager Dr. Ing. h. c. F. Porsche AG, Germany 2018 - 2019
Internship in the automotive industry Dr. Ing. h. c. F. Porsche AG, Germany 2017 - 2018
Bachelor of Science in Electrical Engineering and Management Kiel University, Germany 2014 - 2018
Working Student Kiel University, Germany 2016 - 2017
Selected Publications
The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology AlignmentScholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics (≥ 0.65 threshold) with F1 = 0.77 at the recommended ≥ 0.85 operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson r = 0.76 - 0.87), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.2026Working PaperJonas Wilinski
Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics (≥ 0.65 threshold) with F1 = 0.77 at the recommended ≥ 0.85 operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson r = 0.76 - 0.87), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.
How Founders Evaluate VCs: A GPT-Based Extraction of Value-Criteria from Online VC ReviewsAcademy of Management Proceedings 2025(1), 188462025Conference PaperOlaf Specht, Jan H. Wilinski, Julius C. Thiesen, Christoph Ihl
Venture Capital (VC) investments positively impact startup success, enhancing operational performance through factors like collaboration and value-added services. While research on investment decisions primarily focuses on investors’ selection criteria and decision-making processes, our study addresses the gap in founders’ perspective. Using Generative Pre-Trained Transformers (GPT) for text classification on a dataset of 8,561 online VC reviews, we extract 9,229 unique value-criteria from founders’ perspectives. A text-embedding cluster method categorizes these criteria into 26 categories. By analyzing additional startup lifecycle data, we determine which value-criteria are crucial at different startup stages. Our findings reveal that investors’ “general social skills” are the most important value-criteria across all startup stages, while more mature startups prioritize more self-serving criteria focused on growth and long-term relationships. Additionally, we observe that founders mostly fulfill the value-criteria by investors, with “general advice” being particularly well-executed.