As the enthusiasm for and success of the Internet of Things (IoT), Cyber-Physical Systems (CPS), and Smart Buildings grows, so too does the volume and variety of data collected by these systems. How do we ensure that this data is of high quality, and how do we maximize the utility of collected data such that many projects can benefit from the time, cost, and effort of deployments? With the development of large AI models such as Large Language Models (LLMs), how can we incorporate cyber-physical data with these powerful tools? Large AI models, including recent varieties based on the transformer architecture, may assist in the acquisition, analysis, manipulation and consumption of data.

The DATA: Data Acquisition To Analysis in the Era of AI workshop aims to look broadly at interesting data from interesting sensing systems and/or how such data can be adopted to large models. The workshop considers problems, solutions, and results from all across the real-world data pipeline. We solicit submissions on unexpected challenges and solutions in the collection of datasets, on new and novel datasets of interest to the community, on experiences and results—explicitly including negative results—in using prior datasets to develop new insights, and on discussions of impact and new found opportunities with large AI models.

LLMs could enhance data quality through sophisticated data cleaning, preprocessing, and augmentation techniques. LLMs can facilitate analysis of data streams while identifying anomalies, inconsistencies, and potential biases. Generative AI can also create synthetic datasets that maintain the essential characteristics of real-world data while expanding the available training samples. This may be valuable when real data is challenging due to privacy concerns or logistical constraints. Transformer models can integrate multi-modal data, such as blending textual inputs from sensor logs with quantitative data from measurements. This new flavor of AI-driven analysis can factor in more contextual information, opening new areas of research in enhancing the predictive and diagnostic capabilities of data-driven AI systems deployed in smart environments.

Furthermore, new areas of future work may emerge from exploring the ethical implications of deploying LLMs within these domains—ensuring that the benefits of AI are equitably distributed while safeguarding user privacy. The workshop's focus on privacy challenges and solutions becomes increasingly relevant in the era of AI, where the capacity to analyze vast amounts of sensitive data poses significant risks.

The workshop aims to bring together a community of application researchers and algorithm researchers in the sensing systems and building domains to promote breakthroughs from integration of the generators and users of datasets. The workshop will foster cross-domain understanding by enabling both the understanding of application needs and data collection limitations.

CALL FOR PAPERS

The workshop seeks contributions across two major thrusts, but is open to a broad view of interesting questions around the collection, dissemination, and use of data as well as interesting datasets:

The collection, evaluation, analysis, and use of data

- Challenges and solutions in data collection, especially around security and privacy
- Challenges and solutions in hardware/system design of data collection devices
- Expectations and norms for data collection from sensor networks, especially those that involve human factors
- Novel insights from existing datasets
- Metadata management for complex datasets
- Synthetic data, including its generation, application, and utility
- Success stories - key properties of useful datasets and how to generalize them
- Preprocessing, cleaning, and fusing datasets
- Preliminary analysis and visualization of the data
- Shortcomings of prior datasets - and how to address them in the future
- Position papers on policies and norms from experimental design through data management and use are explicitly welcomed
- Role of cyber-physical or similar data and metadata for informing training and inference of large AI models and applications, such as large language models (LLMs)
- Insights on generative AI to synthesize data
- Usage of multi-modal data within single AI model
- Pitfalls on AI models with cyber-physical or similar data and metadata
- Potential applications of large AI models within cyber-physical space
- Challenges and solutions in privacy protection with large AI models

New and interesting datasets, including but not limited to:

- Shopping related sensing data
- Animal related data or sensed data
- Anonymized health, or synthetic health related data
- Indoor localization, especially unprocessed/unfiltered physical layer measurements
- Smart building, occupancy, motion data, energy, human comfort, vibration, BIM
- Vehicular, GPS, cellular, or Wi-Fi traces and remote sensing
- Reproductions of prior work that validate, refute, or enhance results
- Anonymized contact tracing, interaction and exposure notification data

To enable the longevity and continued utility of submitted datasets, all datasets must be uploaded to a permanent data repository such as a Zenodo or CRAWDAD as part of the camera-ready preparation. Submissions may refer to datasets hosted on personal or temporary hosting but this hosting must be made permanent by time of publication.

Submission Format

Submissions may range from 1-5 pages in PDF format, excluding references, using the standard ACM conference template. DATA 2024 follows the single-blind review policy. The names and affiliations of all the authors must be present in the submitted manuscript. Submissions are strongly encouraged to use only as much space as needed to clearly convey the significance of the work—we fully expect many submissions, especially datasets, to use only 1-2 pages, but wish to allow those interested in fully elucidating positions on data collection and use or insights from reproducibility efforts ample space to do so. Submissions should use only as much space as necessary to clearly convey their ideas and contributions.

Submission Site

HotCRP link

Important Dates (UTC-12)

Workshop Paper Due: ~~September 15, 2024, AoE~~ Extended: September 20, 2024, AoE

Workshop Paper Notification: September 27, 2024, AoE

Workshop Paper Camera Ready: October 4, 2024, AoE

Workshop Day: November 6th, 2024

ORGANIZATION

Co-Chairs & TPC Chairs

Gabe Fierro Colorado School of Mines

Shiwei Fang Augusta University

Steering Committee

Jie Gao Stony Brook University

Pei Zhang University of Michigan, Ann Arbor

Flora Salim University of New South Wales

Mikkel Baun Kjærgaard University of Southern Denmark

Shijia Pan University of California, Merced

Pat Pannuto University of California, San Diego

Prabal Dutta University of California, Berkeley

Jie Liu Harbin Institute of Technology

Chien-Chun Ni Yahoo! Research

Haeyoung Noh Stanford University

Web Chair

Mete Saka Colorado School of Mines

Technical Program Committee

Wan Du University of California, Merced

Andreas Reinhardt Technical University of Clausthal

Zhengxiong Li University of Colorado, Denver

Luca Davoli University of Parma

Fatima M. Anwar University of Massachusetts Amherst

Keyang Yu Marquette University

Zi Wang Augusta University

Yasra Chando University of Massachusetts Amherst

Artifact Evaluation Committee

Hui Wei University of Massachusetts Amherst

Mohammad Rastikerdar University of Massachusetts Amherst

Zihao Mo Augusta University

Su Wang Colorado School of Mines

Xiaoguang Guo Colorado School of Mines

THE VENUE

The Dragon Hotel, Hangzhou, China

The 7th DATA workshop is co-located with SenSys 2024.

For venue details, visa information, etcetera please visit the SenSys venue page.