1st SMASHING WORKSHOP

Europe/Ljubljana
University of Nova Gorica, Lanthieri mansion, Vipava, Slovenia

University of Nova Gorica, Lanthieri mansion, Vipava, Slovenia

Description

This workshop is the first network meeting of the SMASH project. SMASH is a multidisciplinary program centered on developing cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) applications for science and humanities. These include climate science, precision medicine, fundamental physics and linguistics.  It is co-funded by the European Union via the Marie Skłodowska-Curie COFUND action and connects scholars from five top-level institutions in Slovenia with 25 associated partners, Slovenian businesses and academic institutions globally.

The First SMASHing workshop will gather scientists working in the SMASH research areas with the aim to create a multi-disciplinary environment that will foster knowledge exchange between different fields and between academia and industry, thereby building the SMASH community. The workshop will be structured around discussions of different classes of ML/AI techniques, followed up with examples on the successful application in SMASH research areas. More specifically the workshop will focus on:

  • Large language models/Graphs/transforemrs and time series
  • Computer vision/generative models 
  • Statistical approaches (simulation based inference, etc) 
  • Reinforcement/unsupervised learning

In addition to invited talks the workshop will have discussion sessions and there will be plenty of time to exchange ideas and build the community. 

Speakers include:

  • Michelangelo Ceci (Department of Computer Science, University of Bari) - Semi-supervised learning, with applications to biological data analysis
  • Todor Ganchev (Head of the Artificial Intelligence Lab, Technical University of Varna, SMASH supervisor) - ML and animal communication
  • Roger Guimera (Universitat Rovira i Virgili, Barcelona, SMASH supervisor) - Computational scientific discovery
  • Lukas Heinrich (Technical University, Munich), TBC - Foundation models & scientific discovery in particle physcs
  • Mario Jurić (Un. of Washington, SMASH supervisor) - ML and survey science in astro
  • Florian List (Un Wienna) - SBI application to astrophycis and cosmology 
  • Tilman Plehn (Heidelberg University) - ML/AI in particle physics
  • Matthew Purver (Queen Mary Un. of London, SMASH supervisor) - Large language models and application to science
  • Baseerat Romshoo (Leibniz Institute for Tropospheric Research, TROPOS), TBC - Multi-layer perceptron in atmospheric physics
  • Lucia Sacchi (Un. of Pavia), ML in biomedical research
  • Tanja Samardžić (Un. of Zurich, Language and Space Lab, SMASH GB member) - Is computational language modelling linguistics?
  • Uroš Seljak (UC Berkeley, SMASH supervisor) - Scientific discovery and ML in cosmology
  • Berend Snijder (ETH Zurich, Institute of Molecular Systems Biology) - ML in personalized medicine
  • Nenad Tomasev (Deepmind) - AI for good: Science and Healthcare
  • Mark Winands (Maastricht University, Netherlands) - Adaptive-Monte Carlo Search and its Application to Science 

Soft skill lecturers:

  • Laura Busato (SISSA Medialab) - Social media for scientists
  • Brigita Jurisic (Business and Strategic Relations Officer, INL, Portugal) - Social and environmental value assessment of AI/ML technologies
  • Paul McGuiness - How to write grant applications

 

 

 

 

    • Registration: Main hall of the Lanthieri mansion

      Main hall of the Lanthieri mansion

    • 1
      Computational language modelling for cognitive and social science
      Speaker: Prof. Matthew Purver (Queen Mary Un. of London)
    • 2
      Semi-supervised learning, with applications to biological data analysis
      Speaker: Prof. Michelangelo Ceci (Department of Computer Science, University of Bari)
    • 10:30
      Coffee break
    • A word about SMASH from our partner and funding institutons (including Q&A from the press)
      • 3
        Introdcution
        Speaker: Gabrijela Zaharijas
      • 4
        Welcome by the Rector of the UNG
        Speaker: Prof. Boštjan Golob (UNG)
      • 5
        A word from the Ministry of Higher Education, Science and Innovation
        Speaker: Dr Jure Gašparič (MVZI)
      • 6
        Short presentations from our partners - Jožef Štefan Institute
        Speaker: Prof. Boštjan Zalar (JSI)
      • 7
        Short presentations from our partners - University of jubljana FRI
        Speaker: Prof. Mojca Ciglarič (UL FRI)
      • 8
        Short presentations from our partners - ARSO
        Speaker: Prof. Mojca Dolinar (ARSO)
      • 9
        Concluding remarks by the beneficiary
        Speaker: Prof. Gabrijela Zaharijas (UNG)
      • 10
        Q&A with Press
    • 12:30
      Lunch
    • 11
      Social and environmental value assessment of AI/ML technologies (soft-skill lecture)
      Speaker: Brigita Jurisic (INL, Portugal)
    • 12
      Is computational language modelling linguistics?
      Speaker: Prof. Tanja Samardžić (Un. of Zurich, Language and Space Lab)
    • 15:45
      Coffee break
    • 13
      Is ChatGPT Transforming Academics' Writing Style?

      Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of ChatGPT style abstracts is estimated to be approximately 35%, if we take the output of one of the simplest prompts, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of ChatGPT into academics' writing style.

      Speaker: Mingmeng Geng (SISSA)
    • 14
      Upgrading Particulate matter Source apportIonment through Data sciencE

      Particulate Matter (PM) has different severe impacts on human health and climate depending on its size and composition (Yang et al. (2018), Daellenbach et al. (2020)). Source apportionment (SA) is the process of identification of ambient air pollution sources and the quantification of their contribution to pollution levels, and is usually conducted through receptor models (RM). Their usual approach is to decompose the measurements into products of fingerprints (or profiles) and time series, based, respectively, on the chemical composition and time evolution of each of the sources of PM.

      The most widely used RM is the Positive Matrix Factorisation (PMF) algorithm (Paatero and Tapper, 1994), although new methodologies are being developed. For instance, the novel Bayesian auto-correlated matrix factorisation method (BAMF, Rusanen et al. 2024) integrates an auto-correlation term emulating real-world pollutant sources time evolution, producing higher accuracy results than PMF. However, both PMF and BAMF struggle to provide well-separated profiles, leading, in turn, to mixed time series contributions. The UPSIDE project (Upgrading Particulate matter Source apportIonment through Data sciencE) aims to reduce profile separation difficulties on RMs through data science techniques.

      For profile improvement, a sparsity-handling algorithm named horseshoe regularisation (Piironen and Vehtari, will be applied to BAMF to improve profile determination. The horseshoe prior encourages some parameters to be close to zero but allows others to have large values. This method reduces the dimensionality of the matter by scaling down the non-significant species for each profile. In such way, profiles are expected to be less noisy and, thus, portray the nature of the atmospheric pollution sources.

      With the aim of testing BAMF capabilities, BAMF with horsehoe (BAMFh) will be applied firstly to different types of synthetic data. Aerosol synthetic datasets for source apportionment evaluation are limited and often are too simplistic to mimic the actual patterns of atmospheric sources. The second aim of this project is to generate synthetic datasets with machine learning, by merging real-world, chamber, and modelling atmospheric sources time series through the Rotation-Based Iterative Gaussianisation (RBIG) machine learning technique. RBIG, after its training phase, will provide a replicable time series of the input sources even if they stem from different databases.

      Subsequently, BAMF with the regularized horseshoe will be applied to the generated synthetic datasets and a thorough parameter tuning seeking for optimal performance will be performed. Then, the outcomes of the BAMFh, BAMF, and PMF will be compared to test these RMs performance. The last phase of the project consists of the application of BAMFs to real-world data. The improved determination of air pollution sources is intended to be used as inputs for source-dependent health and climate studies.

      References

      Daellenbach, K. R. et al. (2020) Nature, 587(7834), 414-419.
      Laparra, V et al. (2011). IEEE transactions on neural networks, 22(4), 537-549.
      Paatero, P. and Tapper, U (1994), Environment., 5(2), 111-126.
      Piironen and Vehtari (2017), Electron. J. Statist., 11(2): 5018-5051
      Rusanen, A. et al. (2024), Atmos. Tech. Disc. 1-2828.
      Yang, M. et al. (2018), Environ. Int., 120, 516-524

      Acknowledgements

      This work was supported by the SMASH project (No. 101081355), funded from the European Union’s Horizon Europe research and innovation programme under the Marie Sklodowska-Curie grant and ARIS programs I0-0033 and P1-0385.

      Speaker: MARTA VIA GONZALEZ (UNG-CAR)
    • 15
      Simulation of particle physics silicon detectors using transformers

      Simulation of physics processes and detector response is a vital part of high energy physics research but also representing a large fraction of computing cost. Generative machine learning is successfully complementing full (standard, Geant4-based) simulation as part of fast simulation setups improving the performance compared to classical approaches.

      A lot of attention has been given to calorimeters being the slowest part of the full simulation, but their speed becomes comparable with silicon semiconductor detectors when fast simulation is used. This makes silicon detectors the next candidate to make faster, especially with the growing number of channels in future detectors.

      This work studies the use of transformer architectures for fast silicon tracking detector simulation. The OpenDataDetector is used as a benchmark detector. Physics performance is estimated comparing reconstructed tracks using the ACTS tracking framework between full simulation and machine learning one.

      Speaker: Tadej Novak (Jozef Stefan Institute)
    • 16
      Foundation models and scientific discovery in particle physcs
      Speaker: Prof. Lukas Heinrich (Technical University, Munich)
    • 17
      Scientific discovery and ML in cosmology
      Speaker: Prof. Uroš Seljak (UC Berkeley)
    • 18
      AI for good: Science and Healthcare
      Speaker: Dr Nenad Tomasev (DeepMind)
    • 10:30
      Coffee break
    • 19
      Testing gravity with cosmology

      In this talk I will first introduce the main motivations to use cosmological data to test the laws of gravity. I shall focus on the established methods and main results. Finally, I will describe how machine learning can help understanding the fundamental laws of nature.

      Speaker: Emilio Bellini (UNG)
    • 20
      Characterizing the Fermi-LAT high-latitude sky with simulation-based inference

      The GeV gamma-ray sky, as observed by the Fermi Large Area Telescope (Fermi LAT), harbors a plethora of localized point-like sources. At high latitudes ($|b|>30^{\circ}$), most of these sources are of extragalactic origin. The source-count distribution as a function of their flux, $\mathrm{d}N/\mathrm{d}S$, is a well-established quantity to summarize this population. We employ sequential simulation-based inference using the truncated marginal neural ratio estimation (TMNRE) algorithm on 12 years of Fermi-LAT data to infer the parameters of the $\mathrm{d}N/\mathrm{d}S$ distribution in this part of the sky for energies between 1 GeV and 10 GeV. While our approach allows us to cross-validate existing results in the literature, we demonstrate that we can go further than mere parameter inference. We derive a source catalogue of detected sources at high latitudes in terms of position and flux obtained from a self-consistently determined detection threshold based on the LAT's instrument response functions and utilized gamma-ray background models.

      Speaker: Christopher Eckner (Center for Astrophysics and Cosmology, University of Nova Gorica)
    • 21
      Automatic feature selection and weighting: tree biodiversity estimators explained by other variables

      In any large data base, most of the features defining a data point are redundant, irrelevant, or affected by large noise, and have to be discarded. To do this, one needs to answer: What is the best dimensionality of a reduced feature space in order to retain maximum information? How can one correct for different units of measure? What is the optimal scaling of importance between features? We use a statistical method, Information Imbalance, to select the most informative feature sets among many possible ones. In an example from the Amazon rainforest, we find sets of biotic and abiotic features to predict tree biodiversity and species richness, and compare common biodiversity estimators for their information content. The differentiable version of this statistic can automatically weight features relative to each other, accounting for units of measure and importance. Other use cases include variable selection in molecular dynamics simulations, clinical data sets and for neural network potentials.

      Speaker: Romina Wild (Scuola Internazionale Superiore di Studi Avanzati (SISSA) Trieste)
    • 22
      Simulation Based Inference applications to astrophysics and cosmology
      Speaker: Dr Florian List (Un. of Vienna)
    • Group photo
    • 12:40
      Lunch
    • Presentation of local companies: ARCTUR, Cosylab, Genialis
      • 23
        Role of SMEs in SMASH
        Speaker: Gabrijela Zaharijas
      • 24
        AI reserach in ARCTUR
        Speaker: Tomi Iljas
      • 25
        Cosylab in AI for real time adaptive treatment
        Speaker: Ado Janse Van Rensburg
      • 26
        AI reserach in Genialis

        About Genialis

        Genialis is the RNA biomarker company. They develop and validate clinically actionable biomarkers to help pharmaceutical and diagnostic partners predict patient responses and guide treatment decisions in the field of oncology. Genialis does that by using biology-informed machine learning that models dozens of biological processes from gene expression of cancer patients’ tumors. Their algorithms are trained on one of the world’s most ethno-geographically diverse cancer data sets, with the mission of creating a world where healthcare delivers the best possible outcomes for patients, their families, and their communities.

        About Dr. Luka Ausec

        Dr. Luka Ausec is the Chief Discovery Officer at Genialis. He holds a Ph.D. in molecular biology and biotechnology from the University of Ljubljana. With a strong background in both biology and computational disciplines, Luka is skilled at innovating solutions where these fields meet. At Genialis, he directs internal R&D and external partner projects, all with the shared goal of transforming medicine through data.

        Speaker: Luka Ausec
    • 27
      Data-driven contextual anomaly search: finding infrared-excess in stars

      "Anomaly detection" covers a broad range of problems and settings. In some instances, it is seen as finding "rare objects", i.e. objects lying in a low-density region of the feature space. However, this task can quickly become difficult, particularly for higher dimensional, noisy or complex (non-rectangular) data where reliable density estimation is non-trivial.
      Additionally, not all low-density points are necessarily interesting anomalies (and vice versa). In many cases, we are interested in specific types of anomalies: points that diverge, in some aspects, from our expectations (if we have reliable models) or from otherwise similar data points, along specific axes (e.g. in a given region of the feature space). In other words, we might be looking for objects that seem normal in every other regard but are "weird" (in some direction) contextually to their other features. This, however, requires some model to tell us what is "normal" conditioned on the context features. If we do not have such models (or if they are computationally expensive), we can take advantage of supervised Machine Learning methods to perform such a search in a data-driven way, without requiring however supervised anomalous examples.
      We present a contextual anomaly detection pipeline for mid-infrared excess in FGK stars: mid-IR excess is a tracer of events such as planetary collisions and Extreme Debris Disks, which appear to be relatively rare. To identify outlier candidates, we use a combination of prediction errors from a set of Random Forests, and statistics using prediction errors of similar neighbouring points. Our pipeline bypasses the need for accurate stellar modelling while providing a high detection sensitivity crucial in the mid-IR. This allows us to scale our search to an unprecedented data set of 4.9 million stars, where we identify 53 mid-IR excess candidates.

      Speaker: Gabriella Contardo (SISSA)
    • 28
      Modelling the collapse of complex societies

      The talk will cover some key modelling issues that come up when considering the long-term development of societies. Of particular importance is the topic of societal collapse as the archaeological record has numerous instances of the phenomenon. I will discuss some of the general modelling philosophy, relevant literature, my own work on ancient societies (Easter Island, the Maya, Roman Empire and Chinese dynasties) and implications for modern society. There are several modelling considerations unique to modern society that will be highlighted.

      As a specific example we propose a simplified model of a socio-environmental system that accounts for population, resources, and wealth, with a quadratic population contribution in the resource extraction term. Given its structure, an analytical treatment of attractors and bifurcations is possible. In particular, a Hopf bifurcation from a stable fixed point to a limit cycle emerges above a critical value of the extraction rate parameter. The stable fixed-point attractor can be interpreted as a sustainable regime, and a large-amplitude limit cycle as an unsustainable regime. The model is generalized to multiple interacting systems, with chaotic dynamics emerging for small non-uniformities in the interaction matrix. In contrast to systems where a specific parameter choice or a high number of dimensions is necessary for chaos to emerge, chaotic dynamics here appears as a generic feature of the system. In addition, we show that diffusion can stabilize networks of sustainable and unsustainable societies, and thus, interconnection could be a way of increasing resilience in global networked systems. Overall, the multi-systems model provides a timescale of predictability (300-1000 years) for societal dynamics comparable to results from other studies, while indicating that the emergent dynamics of networks of interacting societies over longer time spans is likely chaotic and hence unpredictable.

      Speaker: Sabin Roman (Jožef Stefan Institute)
    • 29
      HF-SCANNER: High frequency sea-level oscillations modeling in the Mediterranean using machine learning

      We will present the HF-SCANNER project, detailing the datasets used, the project's goals, and the preliminary results. The HF-SCANNER project aims to develop a fast and accurate forecasting system for high-frequency sea level oscillations (HFOs) and meteotsunamis in the Mediterranean using deep learning and data from both simulations (ECMWF) and observations (sea level and air pressure). Intense HFOs in the Mediterranean region, sometimes leading to destructive meteotsunamis, occur due to specific and spatially limited meteorological conditions. Despite the understanding of their physical dynamics, current forecasting systems based on hydrodynamic models are unreliable and computationally expensive. To address this problem, the HF-SCANNER project aims to build deep learning models of HFOs at locations with sufficiently long measurements and transfer them to locations with limited data. We will also address the challenges posed by the datasets, such as shorter recording intervals, data gaps and outliers in certain locations, and how these challenges were overcome. We will present results obtained using deep convolutional neural network trained on simulated data (mean sea-level pressure, geopotential heights, specific humidity, wind speed and air temperature) and measurements (1-min sea levels). Preliminary results show that the initial model developed for the Bakar tide-gauge station can predict the highest expected HFO amplitudes for the next three days with a reasonable level of accuracy.

      Speaker: Iva Međugorac (University of Nova Gorica)
    • 30
      Meta-learning in evolutionary reinforcement learning: some paths forward

      Recent years have seen a surge of interest in evolutionary reinforcement learning (evoRL), where evolutionary computation techniques are used to tackle reinforcement learning (RL) tasks. Naturally, many of the existing ideas from meta-RL can also be applied in this context. This is particularly important when handling dynamic (non-stationary) RL environments, where agents need to respond swiftly to changes (shifts) in the environment. We will discuss several research paths aimed at integrating meta-RL with evoRL, particularly those that leverage less orthodox principles, such as evolvability and higher-order meta-learning through meta-mutation rates. The incorporation of these principles is expected to lead to greater sample efficiency when dealing with dynamic and/or noisy RL environments, which are typical of most real-life RL applications.

      Speakers: Bruno Gašperov (University of Ljubljana, Faculty of Computer and Information Science), Bruno Gašperov (University of Ljubljana, Faculty of Computer and Information Science)
    • 16:30
      Coffee break
    • Public event: Impact of AI on Science and Society
    • 18:30
      Refreshments
    • 32
      ML and animal communication
      Speaker: Prof. Todor Ganchev (Varna Un.)
    • 33
      HIDRA3: a robust deep-learning model for multi-point ensemble sea level forecasting

      Accurate modeling of sea level and storm surge dynamics with several day-long temporal horizons is essential for effective coastal flood response and the protection of coastal communities and economies. The classical approach to this challenge involves computationally intensive ocean models that typically calculate sea levels relative to the geoid, which must then be correlated with local tide gauge observations of sea surface heights (SSH). A recently proposed deep learning model, HIDRA2, avoids numerical simulations while delivering competitive forecasts. Its forecast accuracy depends on the availability of a sufficiently large history of recorded SSH observations used in training. This makes HIDRA2 less reliable for locations with less abundant SSH training data. Furthermore, since the inference requires immediate past SSH measurements at input, forecasts cannot be made during temporary tide gauge failures. We address the aforementioned issues with a new architecture, HIDRA3, that considers observations from multiple locations, shares the geophysical encoder across the locations, and constructs a joint latent state, which is decoded into forecasts at individual locations. The new architecture brings several benefits: (i) it improves training at locations with scarce historical SSH data, (ii) it enables predictions even at locations with sensor failures, and (iii) it reliably estimates prediction uncertainties. HIDRA3 is evaluated by jointly training on eleven tide gauge locations along the Adriatic. Results show that HIDRA3 outperforms HIDRA2 and the standard numerical model NEMO by ~15% and ~13% MAE reduction at high SSH values, respectively, setting a solid new state-of-the-art. Furthermore, HIDRA3 shows remarkable performance at substantially smaller amounts of training data compared with HIDRA2, making it appropriate for sea level forecasting in basins with large regional variability in the available tide gauge data.

      Speakers: Marko Rus (Slovenian Environment Agency, Office for Meteorology, Hidrology and Oceanography, Ljubljana, Slovenia), Marko Rus
    • 34
      Coarse reconstruction with iterative refinement network for sparse spatio-temporal satellite data

      Sea surface temperature (SST) is critical for weather forecasting and climate modeling, however remotely sensed SST data often suffer from incomplete coverage due to cloud obstruction and limited satellite swath width. While deep learning approaches have shown promise in reconstructing missing data, existing methods struggle to accurately recover fine-grained details, which, however are crucial for many down-stream geophysical processing and prediction problems. We propose CRITER (Coarse Reconstruction with Iterative Refinement network), a novel two-stage approach comprising: (i) a transformer-based Coarse Reconstruction Module (CRM) that estimates low-frequency SST components by leveraging global spatio-temporal correlations in available observations, and (ii) an Iterative Refinement Module (IRM) for recovering high-frequency details absent from the initial CRM estimate. Extensive experiments across Mediterranean, Adriatic, and Atlantic sea datasets reveal CRITER's superior performance over the state-of-the-art DINCAE2 model. CRITER achieves substantial reconstruction error reductions in both missing and observed regions: $20\%$ and $89\%$ for the Mediterranean, $44\%$ and $80\%$ for the Adriatic, and $1\%$ and $88\%$ for the Atlantic dataset, respectively. These results mark a significant advancement in the field of sparse geophysical data reconstruction.

      Speaker: Matjaž Zupančič Muc (Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia)
    • 35
      A neural network-based observation operator for weather radar data assimilation

      One of the most challenging tasks in Numerical Weather Prediction (NWP) is forecasting convective storms. Data Assimilation (DA) methods improve the initial condition and subsequent forecasts by combining observations and previous model forecast (background). Weather radar provides a dense source of observations in storm monitoring. Therefore, assimilating radar data should significantly improve storm forecasting skills. However, extrapolation of rainfall patterns (nowcasting) from radar data is often better than numerical-model-based forecasting with DA in the first 2 or 3 hours (Fabry and Meunier, 2020). This is related to the fact that the radar data only provides information on the precipitation pattern and intensity in the area affected by the storm. Furthermore, it does not directly provide information on other variables that are strongly linked with the storm, such as temperature, wind, and humidity, either within the precipitation region or in the areas far from the storm.  
      One potential solution to this problem could be to use machine learning (ML) techniques to construct the DA observations operator to generate a model-equivalent of the radar data. In this approach, NWP model fields (temperature, wind components, relative humidity, precipitation) would serve as input and radar observations would be the output of an encoder-decoder neural network. The constructed observation operator would describe a non-linear relationship between the NWP model storm-related variables and radar observations, allowing radar information to be spread to other variables and potentially enhancing storm forecasting skills.

      Speaker: MARCO STEFANELLI (University of Ljubljana)
    • 11:15
      Coffee break and light lunch
    • 36
      Unlocking European-level HPC-support

      The EPICURE project aims to enhance support for European supercomputer users, particularly within the EuroHPC network. It covers several key areas: code enablement, performance analysis, benchmarking, refactoring and optimization. Each aspect involves porting and refining applications to ensure they scale efficiently across larger node counts in high-performance computing environments. By providing specialized Application Support Teams (ASTs), EPICURE focuses on improving the scalability, performance, and optimization of user codes across various EuroHPC systems. The project also plans to create a comprehensive portal for user support and training, fostering collaboration and knowledge-sharing among users from academia and industry. Funded by the EuroHPC Joint Undertaking, EPICURE's goal is to boost European research and innovation through better utilization of high-performance computing resources which are used also to develop Machine learning applications for science and humanities. By using EPICURE resources, we envision to collect and share the knowhow from the EuroHPC hosting facilities and help our users achieve their goals.

      Speaker: Žiga Zebec (IZUM)
    • 37
      On the representation landscape of large transformer models

      Large transformers have been successfully applied to self-supervised data analysis across various data types, including protein sequences, images, and text. However, the understanding of their inner working is still limited. We discuss how, by applying unsupervised learning techniques, we can describe several geometric properties of the representation landscape of these models and how they evolve across their layers. This geometric perspective allows us to point out an explicit strategy to identify the layers that maximize semantic content and to uncover diverse computational strategies that transformers develop to solve specific tasks. Our findings have several applications, from improving protein homology searches to increasing factual-recall in language models, and they offer insight into novel strategies combining in-context-learning and fine-tuning to solve question answering tasks.

      References:
      (1) L. Valeriani, D. Doimo, F. Cuturello, A. Laio, A. Ansuini, A. Cazzaniga, "The geometry of hidden representations of large transformer models", Advances in Neural Information Processing Systems 36 (2023)
      (2) F. Ortu, Z. Jin, Diego Doimo, M. Sachan, A. Cazzaniga, B. Schölkopf, "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals", Annual Meeting of the Association for Computational Linguistics 62 (2024)
      (3) D. Doimo, A. Serra, A. Ansuini, A. Cazzaniga, "The Representation Landscape of Few-Shot Learning and Fine-Tuning in Large Language Models", to appear

      Speaker: Alberto Cazzaniga (Area Science Park)
    • 38
      Interpreting Hidden Representations of Transformer Models Through Topological Data Analysis

      This talk addresses the challenge of interpreting the high-dimensional hidden representations in Transformer models, a critical issue given their widespread use in sequential data tasks. We propose using Topological Data Analysis (TDA), a powerful mathematical approach that allows us to understand the shape and structure of complex data. Using TDA, we develop a framework that follows the evolution of representations across layers of the transformer, treating it as a dynamical system that evolves in time. The framework allows us to measure the change in the degree of similarity of relations among representations across the model's depth, providing insights into how the models organize information by moving representations in high-dimensional space.

      Speaker: Matteo Biagetti (Area Science Park)
    • 39
      ML in personalized medicine
      Speaker: Prof. Berend Snijder (ETH Zurich, Institute of Molecular Systems Biology)
    • Excursion and conference dinner
    • 40
      Computational scientific discovery
      Speaker: Roger Guimera (Universitat Rovira i Virgili, Barcelona)
    • 41
      ML/AI in particle physics
      Speaker: Prof. Tilman Plehn (Heidelberg University)
    • 10:30
      Coffee break
    • 42
      Using machine learning to search for scalar lepton partners at the LHC

      We discuss LHC searches for simplified models in which a singlet Majorana dark matter candidate couples to Standard Model leptons through interactions mediated by scalar lepton partners. We summarize the dark matter production mechanisms in these scenarios, highlighting the parameter space which can both satisfy the relic density and account for muon g-2. We focus on the case of intermediate mass splitting (~30 GeV) between the dark matter and the scalar, for which the LHC has made little improvement over LEP due to large electroweak backgrounds. We find that the use of machine learning techniques can push the LHC, at an integrated luminosity of 300/fb, well past discovery sensitivity for a benchmark model with a lepton partner mass of ∼110 GeV and towards the exclusion of models with a lepton partner mass as large as ∼160 GeV.

      Speaker: Patrick Stengel (Jozef Stefan Institute)
    • 43
      Probing the nature of axion-like particles

      Axion-like particles (ALPs) are promising candidates from theories beyond the Standard Model, possibly linked to dark matter. When subjected to external magnetic fields, ALPs can convert to photons and vice versa, rendering them observable. The ALP-photon mixing distorts the gamma-ray blazar spectra with measurable effects, albeit tiny. The description of blazar jets varies per target and demands multi-wavelength data but a global study of the jet in the ALP scenario is still missing. The primary part of my project aims to target this issue by performing a structural study of blazar jets with open-source modeling tools, including polarization measurement. Additionally, we plan to harness the power of the supervised Machine Learning (ML) approach to investigate the interplay between photons and ALPs, leveraging gamma-ray data from the upcoming Cherenkov Telescope Array (CTA) and the Large Size Telescope (LST). By training the ML network with a large set of simulated data from CTA and LST for a diverse set of magnetic fields and coupling constants, we seek to yield robust constraints on ALP-photon interactions and make substantial advancements in this field.

      Speaker: Pooja Bhattacharjee (University of Nova Gorica)
    • 44
      Leveraging Machine Learning to Detect Dark Matter Subhalos in the Milky Way

      Dark matter remains a crucial missing piece in our understanding of the Universe. Since the late 1970s, the astrophysics community has widely accepted that visible galaxies lie at the centre of large dark matter halos. Significant progress has been made in understanding the halo that hosts our own Milky Way galaxy, including its overall mass and density distribution. However, the halos themselves contain smaller dark matter subhalos, and although the most massive of these subhalos, which host dwarf galaxies, can be observed, the smaller subhalos that do not form stars remain undetected. The detection of these non-luminous, dark subhalos would provide valuable insights into the nature of dark matter. In this talk, we will explore a novel method for finding dark subhalos by detecting stellar wakes —perturbations in the positions and velocities of stars caused by interactions with orbiting dark subhalos. We will discuss the feasibility of using supervised and unsupervised Machine Learning techniques to detect these stellar wakes, given the tremendous increase in high-precision data from current and future astronomical surveys.

      Speaker: María Benito (Tartu observatory, University of Tartu)
    • 45
      ML in biomedical research
      Speaker: Lucia Sacchi (Un. of Pavia)
    • 12:30
      Lunch
    • 46
      How to write grant applications (soft skill lecture)
      Speaker: Paul McGuiness
    • 47
      Machine Learning-Guided Nanobody Discovery for Enhanced Biomarker Research

      The integration of machine learning (ML) with advanced biomarker discovery techniques offers new opportunities for pathology, particularly in personalized medicine. This research will focus on using nanobodies—small, stable, and highly specific single-domain antibodies derived from camelids—as versatile tools for advancing biomarker research. We plan to utilize ML to explore a diverse, naïve nanobody library, with the aim of mapping the "hidden human epitopome"—an unexplored landscape of potential antigens and epitopes.

      To achieve this, we will combine high-throughput sequencing and state-of-the-art ML-driven 3D structural modeling with deep-learning-based molecular docking. This methodology will allow us to systematically predict and validate nanobody interactions with a wide range of human protein structures, ultimately creating a comprehensive catalog of nanobody-binder pairs. We will validate this approach through biophysical and biochemical assays, complemented by structural biology techniques.

      The goal of this research is to uncover novel biomarker-epitope interactions, enhance our understanding of disease mechanisms, and contribute to the development of new diagnostic tools and personalized therapies. By bridging advanced computational methods with the biological potential of nanobodies, we aim to redefine biomarker discovery and provide a robust platform for future therapeutic interventions.

      Speaker: Klara Kropivsek (Laboratory for Environmental and Life Sciences, University of Nova Gorica)
    • 48
      Normalizing flows for evidence estimation

      I introduce floZ, an improved method based on normalizing flows, for estimating the Bayesian evidence (and its numerical uncertainty) from samples drawn from the unnormalized posterior distribution. I validate it on distributions whose evidence is known analytically, up to 15 parameter-space dimensions and I demonstrate its accuracy for up to 200 dimensions with $10^5$ posterior samples. I show its comparison with nested sampling (which computes the evidence as its main target).
      Provided representative samples from the target posterior are available, this method is more robust to posterior distributions with sharp features, especially in higher dimensions. I introduce a convergence test to determine when the normalizing flow has identified the final distribution. Finally, I show the flow's adaptability in the context of transfer learning.

      Speaker: Rahul Srinivasan (SISSA, Italy)
    • 49
      Detecting faint VHE gamma-ray sources using Deep Neural Networks

      The detection of faint γ-ray sources is an historical challenging task for the very-high energy astrophysics community. The standard approaches to identify sources rely on likelihood analyses. However, our lack of knowledge of background uncertainties can introduce strong biases in the results and hinder a detection. The field of machine learning (ML) has advanced dramatically over the past decade and its capability has been proven in a multitude of different applications. Previous works (Panes et al. 16) have shown the potential of a Convolutional Neural Network (CNN) based pipeline focused on U-Net based algorithms, called AutoSourceID, to detect point sources on simulated Fermi-LAT data. In this presentation, we will discuss the next natural steps to further develop this pipeline which consist in implementing the detection and characterization of extended and overlapping sources. We will discuss the application of the Mask R-CNN (Tabernik & Skocaj 2019) and spare-shot learning algorithms to the AutoSourceID pipeline to aid in the detection of extended and overlapping sources in the context of the future Cherenkov Telescope Array Observatory (CTAO) data. The use of this novel pipeline based on ML techniques will serve as benchmark for future CTAO analyses and due to its inherent adaptability, will also aid solving other γ-ray puzzles related to faint sources as source identification in the galactic plane or detection of galaxy clusters.

      Speaker: Judit Pérez Romero (CAC/UNG)
    • 50
      Looking for anomalous variability: wandering unsupervised

      Astronomical datasets include millions, sometimes billions of records, and in order to handle such volumes of data in the last 20 years astronomers actively use ML methods for various classification and characterization tasks. However, most of those applications utilize supervised ML, which requires large pre-existing training samples. Obtaining those training samples is a complicated task, and it often introduces poorly accounted biases.
      One tantalizing possibility is to use unsupervised ML for developing data-driven classifications and, most interestingly, for searching extremely rare or even previously unseen types of objects and phenomena. While such a possibility has been discussed for decades, the achievements in this area are much less prominent than in applications of supervised ML. In this talk, I describe the current state of the field, outline the perspectives of using unsupervised ML for the anomaly and novelty detection in large-scale astronomical surveys, in particular in the upcoming LSST, discuss the problems arising on this path and possible solutions to them, and highlight some related issues that must be addressed by the astronomical community as a whole.

      Speaker: Oleksandra Razim (Ruder Boscovic Institute)
    • 16:00
      Coffee break
    • 51
      ML and survey science in astro
      Speaker: Mario Jurić (Un. of Washington)
    • 52
      From Blurry to Brilliant: How Generative AI is Transforming the Way We See Images
      Speaker: Nikki Martinel (Un. of Udine)
    • 53
      PCA analysis for absorption measurements correction schemes– A test study

      Light absorbing carbonaceous aerosols (LAC) contribute positive forcing to the Earth radiative budget, which results in atmospheric warming. To determine the actual contribution of LAC aerosols, measurements from across the globe are incorporated into climate models. The most used approach for this measurement is by using filter-photometers (FP), which measure the attenuation of light through a filter where aerosols are being deposited. Of all the possible in-situ surface measurements, the most deployed instrument across measurement networks is the aethalometer AE33 (Aerosol d.o.o.; Drinovec et al., 2015). More sophisticated FP instruments, such as the Multi-Angle Absorption Photometer (MAAP, Thermo Scientific Inc.; Petzold and Schönlinner, 2004) are also deployed in some stations, acting as pseudo-reference instruments.

      The use of FPs results in several artifacts, with cross-sensitivity to scattering being most important at high single scattering albedo, with the error exceeding 100%. This cross-sensitivity to scattering has been found to have site-to-site variability (e.g., Bernardoni et al., 2021, Yus-Diez et al., 2021). In Yus-Diez et al. (2021) a compensation scheme was proposed. However, it requires the availability of scattering coefficient measurements, less frequent across networks, and a direct measurement of the aerosol absorption coefficient which can serve as the reference for the measurement with the AE33. In Yus-Diez et al. (2021), a MAAP and a laboratory-based FP, the PP_UniMi, were used as reference instruments, however, although to a lower degree, these are also subject to the same artifacts.

      Here we show measurements performed during the AGORA 2023 summer campaign in Granada, with ambient measurements using the AE33 and MAAP filter photometers and a novel reference absorption instrument, the dual-wavelength photo-thermal interferometer (PTAAM-2λ, Haze Instruments d.o.o.; Drinovec et al., 2022). In addition, we also measured scattering coefficients, particle size distribution and other variables that describe the properties of aerosol particles.

      Here we present the result of applying the correction scheme described in Yus-Diez et al. (2021) by using the PTAAM-2λ reference, which results in a perfect correction. However, the required instrumentation for this correction is not available in stations where filter photometer measurements are taken. Since we do not have the scattering measurements and/or deploy a PTAAM-2λ everywhere, we resort to Machine Learning (ML) models.

      As a proof of concept, we applied a gradient boosting regressor model, where we splited the data into 80% training and 20% prediction. We were able to obtain an excellent correlation (slope=0.98, R2 = 0.96) between the MAAP measurements processed using the ML algorithm and the reference PTAAM’s absorption coefficient (interpolated to the MAAP wavelength). As future work, we plan to further improve the model while also extending it to the AE33 instrument measurements.

      References:
      - Petzold A and Schönlinner M, The Multi-angle absorption photometer – A new method for the measurement of aerosol light absorption and atmospheric black carbon. Journal of Aerosol Science, 35, 421-441, 2004
      - Drinovec, L., Močnik, G., Zotter, P., Prévôt, A.S.H., Ruckstuhl, C., Coz, E., Rupakheti, M., Sciare, J., Müller, T., Wiedensohler, A., Hansen, A.D.A., 2015. The "dual-spot" Aethalometer: an improved measurement of aerosol black carbon with real-time loading compensation, Atmospheric Measurement Techniques, 8, 1965–1979. https://doi.org/10.5194/amt-8-1965-2015
      -Bernardoni, V., Ferrero, L., Bolzacchini, E., Forello, A. C., Gregorič, A., Massabò, D., Močnik, G., Prati, P., Rigler, M., Santagostini, L., Soldan, F., Valentini, S., Valli, G., and Vecchi,R.: Determination of Aethalometer multiple-scattering enhancement parameters and impact on source apportionment during the winter 2017/18 EMEP/ACTRIS/COLOSSAL campaign in Milan, Atmos. Meas. Tech., 14, 2919–2940, https://doi.org/10.5194/amt-14-2919-2021, 2021
      - Yus-Díez, J., Bernardoni, V., Močnik, G., Alastuey, A., Ciniglia, D.,Ivančič, M., Querol, X., Perez, N., Reche, C., Rigler, M., Vecchi, R., Valentini, S., and Pandolfi, M., 2021. Determination of the multiple-scattering correction factor and its cross-sensitivity to scattering and wavelength dependence for different AE33 Aethalometer filter tapes: A multi-instrumental approach, Atmos. Meas. Tech https://doi.org/10.5194/amt-2021-46 Drinovec, Luka, et al."A dual-wavelength photothermal aerosol absorption monitor: design, calibration and performance." Atmospheric Measurement Techniques 15.12 (2022): 3805-3825.

      Speaker: JESUS YUS DIEZ (Univerza v Novi Gorici)
    • 54
      Towards an Automatic Source Detection Pipeline in the Galactic Plane Survey by CTA using Deep Learning

      The Galactic Plane Survey (GPS) as proposed by CTAO is one of the key science projects that will cover an energy range from ~30 GeV to ~100 TeV with unprecedented sensitivity leading to an increase in the known gamma-ray source population by a factor of five.
      Here we tested our deep-learning-based automatic source detection techniques and compared them with traditional likelihood detection methods. For the simulation, we considered the inner-galactic region ∣b∣≤6, different energy bins with relevant IRFs as implemented in Gammapy (open-source software developed by CTAO) and all point-like sources with the same extensions. Our automatic source detection and localization pipeline (ASID) based on U-Shaped network (U-Net) and Laplacian of Gaussian (LoG) has also been tested for Fermi-LAT data and optical data (from MeerLICHT telescope).
      We show that with our pipeline and Log-scaled counts map, we could achieve 2x sensitivity than expected from GPS using likelihood methods, especially for identifying fainter sources. We also show the potential to detect DM subhalos (as point sources) using our method and we could achieve a lower limit on the self-annihilation thermal cross-section $\langle \sigma v \rangle$ = $2.4 \times 10^{-23}$ cm$^3$s$^{-1}$.
      We are also currently exploring some diffusion-based methods to remove background from the generated counts map to further increase the detection and localization efficiency.

      Speaker: Zoja Rokavec
    • 55
      Multi-Messenger and Multi-Wavelength emission from Galaxy Clusters hosting AGNs using Machine Learning Techniques

      This work is dedicated to develop the most detailed comprehensive numerical framework to date, combining magnetohydrodynamics (MHD) and Monte-Carlo simulations to derive the multi-wavelength (MWL) and multi-messenger spectra from the magnetized environment of galaxy clusters. Special attention will be given to Perseus-like clusters hosting active galactic nuclei (AGNs). We will study Propagation of the cosmic ray (CR) in the ICM through multi-dimensional Monte-carlo simulations, considering all relevant interaction processes. Machine learning and data analysis techniques is the essential part of the research work to ensure the robustness and timely production of research products. This work is suitable to appropriately interpret the observed emissions from clusters and provide source distribution constraints for various observation facilities such as CTA, HAWC, LHAASO, IceCube-Gen2, TA, and Pierre Auger Observatory, etc.

      Speaker: Saqib Hussain (University of Nova Gorica)
    • 56
      Inverse systems approach to design Secure Random Communications Systems

      This contribution presents an enhanced approach to secure communications using multiple inverse systems to design alpha-stable noise based Random communication systems (RCSs). This method incorporates multiple inverse systems to transform the encoded alpha stable noise signals on the transmitter side with corresponding inverse systems on the receiver side to decode the retrieved signals received from the AWGN channel. This modification enhances the RCS efficiency with respect to Bit Error Rate (BER) and covertness value. By distributing the encoding and decoding processes across multiple systems along with other hidden benefits of alpha stable noise, the proposed RCS increases the complexity for potential eavesdroppers, making it extremely unvulnerable to intercept. Simulation results demonstrate that the use of multiple inverse systems provides superior BER performance compared to previous models, while the covertness analysis indicates a marked improvement in the security of the system. These findings suggest that the multi-inverse system design holds promise for enhancing the physical layer security of next-generation communication systems.

      Speaker: Dr Areeb Ahmed (University of Ljubljana)
    • 10:45
      Coffee break
    • 57
      Social media for scientists (soft-skill lecture)
      Speaker: Laura Busato (SISSA Medialab)
    • 58
      Q&A with the SMASH PI for the Fellows
    • 13:15
      Lunch