The Archive

The Murchison Widefield Array (MWA) Data Archive consists of dataflow and storage sub-systems distributed across three tiers. At its core is the open source software - the Next-Generation Archive System (NGAS) that was initially developed by Andreas Wicenec and his colleagues at European Southern Observatory (ESO). To meet the MWA data challenge, the MWA Archive team has tailored and optimised NGAS to achieve high-throughput data ingestion, efficient dataflow management, cost-effective data storage/access and processing-aware data migration.

Image: The MWA Dataflow consists of three tiers - Tier 0 is where data are produced, Tier 1 is where data are archived, and Tier 2 is where data are distributed.

Tier 0, co-located with the telescope at the Murchison Radio-telescope Observatory (MRO), consists of Online Processing and Online Archive. Radio frequency voltage samples collected from each tile are transmitted to the receiver, which streams digitised signals at an aggregated rate of 320 Gbps to Online Processing. Online Processing includes FPGA-enabled Polyphase Filter Bank (PFB) and a GPU-enabled software Correlator. The Correlator outputs in-memory “visibilities” that are immediately ingested by the DataCapture sub-system. DataCapture produces memory-resident data files and uses an NGAS client to push files to Online Archive managed by the NGAS server. The online archive provides several day's worth of storage buffer which allows the telescope to continue observing even if there are unexpected delays or issues transporting the data down to tier 1.

At Tier 1 (Perth, a city 700 km south of the MRO), the Long-term Archive (LTA) periodically ingests visibility data stream from Online Archive (OA) via a 10Gbps fibre optic link (i.e. the shaded arrow from OA to LTA), which is a part of the Australian high speed National Broadband Network (NBN). The data rates from Tier 0 are between 3.2 and 8 Gbps, depending on the telescope configuration and observing mode. The dotted arrow between OA and LTA represents the transfer of metadata on instruments, observations, and monitor and control information. The current LTA storage facility — the Pawsey Hierarchical Storage Management (HSM) — is a combination of magnetic disks and tape libraries provided by the Pawsey Supercomputing Centre, whose mission is to foster scientific and technological innovation through the provision of supercomputing and eResearch services. The MWA LTA has been accumulating raw science data since mid 2013 with a growth of between 3 PB and 8 PB per year. As at the end of 2018, the LTA held over 28 PB of data.

Researchers from around the world access the MWA archive via the MWA All-Sky Virtual Observatory (MWA ASVO) data portal. The MWA ASVO web portal allows users to download raw visibility data in native MWA correlator format, or download calibrated visibility data in standard radio astronomy formats via a web-based portal or via a Python API and command-line client. The MWA ASVO provides the data that researches use in offline processing such as calibration and imaging. Science teams process the data on HPC facilities around the world, including Galaxy, Magnus and Zeus at the Pawsey Supercomputing centre.

When large volumes of data are required by member institutions, data (and metadata) can be selectively mirrored via the establishment of Mirrored Archives (MA) at Tier 2. Tier 2 comprises mirrored archive facilities that subscribe to specific data products within Tier 1, and continuously ingests updated data streams of relevant data types on a regular basis. Previously, Tier 2 data archive facilities operated at: the Massachusetts Institute of Technology (MIT), USA; the Victoria University of Wellington (VUW), New Zealand; Raman Research Institute (RRI), India; and the University of Melbourne, Australia. Data in transit from Tier 1 to Tier 2 was carried over by the Australia Academic and Research Network (AARNET) across the Pacific Ocean. These MAs can host a subset of the data products originally available in LTA and provide processing capabilities for local scientists to reduce and analyse data relevant to their research projects. While the LTA in Tier 1 can periodically push data to MAs in Tier 2 in an automated fashion, one can also schedule ad-hoc data transfers from the LTA to a Tier 2 machine via User Interfaces (UI). Web interfaces and Python APIs are also available for MWA scientists to either synchronously retrieve or asynchronously receive raw visibility data.