November 19, 2025•Golden, Colorado, USA

Eighth International BuildSys Workshop on DataFM

Data Acquisition & Analysis with Foundational Models

Learn More Call for Papers

Photo courtesy of the Colorado Tourism Office

Program

*All times are local (Golden, Colorado, USA)(UTC-6)

8:30 – 9:15

Opening Remarks & Talk

(Jointly with FMSust 2025)

Opportunities and Challenges with Foundation Models for Smart Environments and Energy Sustainability Pandarasamy Arjunan (Indian Institute of Science)

9:30 – 10:30

Technical Session 1

LLM-Powered Data Annotation for Bridging the Semantic Gap in Air Quality Monitoring
Ragini Gupta, Abbas Ali Mirza (University of Illinois at Urbana-Champaign); Claudiu Danilov, Josh Eckhardt, Keyshla Bernard (The Boeing Company, USA); Klara Nahrstedt (University of Illinois Urbana-Champaign)
No One-Model-Fits-All: Uncovering Spatio-Temporal Forecasting Trade-offs with Graph Neural Networks and Foundation Models
Ragini Gupta, Naman Raina, Bo Chen (University of Illinois at Urbana-Champaign); Li Chen (University of Louisiana at Lafayette); Claudiu Danilov, Josh Eckhardt, Keyshla Bernard (The Boeing Company, USA); Klara Nahrstedt (University of Illinois Urbana-Champaign)
A Multi-Temporal LiDAR-Derived Geospatial Dataset for Coastal Hazard Assessment
Jiyue Zhao (University of Georgia); Zi Wang (Augusta University)

10:45 – 11:30

Technical Session 2 (Datasets)

Dataset: Device Activity Report with Complete Knowledge (DARCK) for NILM
Justus Breyer, Kai Gützlaff, Leonardo Pompe, Klaus Wehrle (RWTH Aachen University)
Dataset: Long-term LoRaWAN Communication Metadata from an Urban Deployment
Fateme Nikseresht, Victor Ariel Leal Sobral, Jonathan L. Goodall, Bradford Campbell (University of Virginia)
Dataset: A Novel Aliasing vs Non Aliasing Audio Dataset for Always-On IoT Microphone Experimentation
Jack Adiletta, Khan Mohammad Nur Hossain (Worcester Polytechnic Institute); Matthew Reynolds (Columbia University); Shiwei Fang (Augusta University); Bashima Islam (Worcester Polytechnic Institute)

11:30 – 11:40

Closing Remarks

About the Workshop

As enthusiasm for and success in the Internet of Things (IoT), Cyber-Physical Systems (CPS), and Smart Buildings continues to grow, so too does the volume and variety of data generated by these systems. This raises important questions: How can we ensure high-quality data collection? And how can we maximize the utility of this data so that multiple projects can benefit from the time, cost, and effort invested in deployments?

With the rise of Foundational Models—particularly Large Language Models (LLMs)—we now have new tools that can potentially transform how we work with cyber-physical data. Yet, real-world data presents notable challenges, including diverse modalities, limited dataset sizes, and unstructured formats. Recent advances in large AI models, especially those based on transformer architectures, offer promise for improving how data is acquired, analyzed, manipulated, and consumed.

The DataFM: Data Acquisition & Analysis with Foundational Models workshop aims to look broadly at interesting data from interesting sensing systems and/or how such data can be adapted to Foundational Models. The workshop considers problems, solutions, and results from all across the real-world data pipeline. We solicit submissions on unexpected challenges and solutions in the collection of datasets, on new and novel datasets of interest to the community, on experiences and results, explicitly including negative results, in using prior datasets to develop new insights, and on discussions of impact and newfound opportunities with large AI foundational models.

Foundational Models could enhance data quality through sophisticated data cleaning, preprocessing, and augmentation techniques. They can also facilitate the analysis of data streams while identifying anomalies, inconsistencies, and potential biases. Generative AI can also create synthetic datasets that maintain the essential characteristics of real-world data while expanding the available training samples. This may be valuable when real data is challenging due to privacy concerns or logistical constraints. Transformer models can integrate multi-modal data, such as blending textual inputs from sensor logs with quantitative data from measurements. This new flavor of AI-driven analysis can factor in more contextual information, opening new areas of research in enhancing the predictive and diagnostic capabilities of data-driven AI systems deployed in smart environments.

Furthermore, new areas of future work may emerge from exploring the ethical implications of deploying Foundational Models within these domains, ensuring that the benefits of AI are equitably distributed while safeguarding user privacy. The workshop's focus on privacy challenges and solutions becomes increasingly relevant in the era of AI, where the capacity to analyze vast amounts of sensitive data poses significant risks.

The workshop aims to bring together a community of application researchers and algorithm researchers in the sensing systems and building domains to promote breakthroughs from the integration of the generators and users of datasets. The workshop will foster cross-domain understanding by enabling both the understanding of application needs and data collection limitations.

Call for Papers

The workshop seeks contributions across two major thrusts, but is open to a broad view of interesting questions around the collection, dissemination, and use of data as well as interesting datasets:

The collection, evaluation, analysis, and use of data

•Role of cyber-physical or similar data and metadata for informing training and inference of foundational models and applications, such as LLMs
•Insights on generative AI to synthesize data
•Usage of multi-modal data within a single AI model
•Pitfalls on AI models with cyber-physical or similar data and metadata
•Potential applications of large AI models within cyber-physical space
•Challenges and solutions in privacy protection with large AI models
•Cyber-physical data embedding techniques for existing foundational models
•Challenges and solutions in data collection, especially around security and privacy
•Challenges and solutions in hardware/system design of data collection devices
•Expectations and norms for data collection from sensor networks, especially those that involve human factors
•Novel insights from existing datasets
•Metadata management for complex datasets
•Synthetic data, including its generation, application, and utility
•Success stories, key properties of useful datasets and how to generalize these
•Preprocessing, cleaning, and fusing datasets
•Preliminary analysis and visualization of the data
•Shortcomings of prior datasets, and how to address these in the future
•Position papers on policies and norms from experimental design through data management and use are explicitly welcomed

New and interesting datasets, including but not limited to:

•Smart building, occupancy, motion data, energy, human comfort, vibration, BIM
•Indoor localization, especially unprocessed/unfiltered physical layer measurements
•Shopping-related sensing data
•Animal-related data or sensed data
•Anonymized health, or synthetic health-related data
•Anonymized human-centric interaction and physiological data from applications such as Extended Reality
•Vehicular, GPS, cellular, or wifi traces and remote sensing
•Reproductions of prior work that validate, refute, or enhance results
•Anonymized contact tracing, interaction, and exposure notification data

To enable the longevity of submitted datasets, we plan on providing a central location where a repository for the data, and information about the data can be archived for at least 5 years.

Submission

Important: Each accepted submission is required to have at least one author attend the workshop and present to the workshop attendees.

Full Papers

Submissions may range from 2-5 pages in PDF format, excluding references, using the standard ACM conference template. Submissions are strongly encouraged to use only as much space as needed to clearly convey the ideas, contributions and the significance of the work.

Dataset Papers

Dataset submissions should prefix paper titles with "Dataset:" and must include a description of the dataset as well as a reasonable accompanying data sample. Once accepted, a fully described dataset must be shared to a public repository by the camera-ready deadline.

Issues on licenses will be resolved following procedures similar to CRAWDAD.

Dataset Evaluation Requirements

Datasets will be reviewed by an artifact evaluation committee. To support this, dataset submissions must include:

Dataset Link

Full dataset access (not just samples)

Example Analysis

Demonstrate potential insights from the data

Usage Steps

Code samples, videos, or demonstrations

Important Dates

Paper Submission Due:

September 22, 2025September 30, 2025 11:59 PM AoE

(Firm, no further extensions will be granted)

Notification:

October 10, 2025

Camera Ready:

October 17, 2025

Workshop Day:November 19, 2025

Submission Link

Submit your papers through the conference submission system:

Submit Paper

Organization

Co-Chairs & TPC Chairs

Shiwei Fang

Augusta University

Yasra Chandio

UMass Amherst

Gabe Fierro

Colorado School of Mines

Web Chair

Salil Verma