MWA SOFTWARE

Monitor and control of a remote telescope

The Monitor and Control (M&C) systems are in place to control the primary functions of the MWA remotely. These tasks are divided into broad areas:

  • Telescope configuration: The physical connectivity of the entire system, and a user interface for managing the contents as hardware is changed in the field. For each of the 128 physical tiles on the ground, we must store data about that tile, including its coordinates, beamformer ID and gain, cable type and length to the receiver, and which input on which receiver it is connected to, etc. We must also record which fibres on each receiver are connected to which boards and inputs on the fine PFB, and how the outputs of the fine PFB are mapped to the inputs on the Voltage Capture System. The configuration is stored in tables in a PostgreSQL database. We store the configuration of the telescope at all times in the past, as well as the current configuration.
  • Telescope state: The state of the telescope is defined as the set of values for all software-selectable parameters that can be changed by the user for scientific or engineering test observations. These are stored in a set of database tables that define the desired state of the telescope at all times (past and future) – the telescope schedule. The status monitoring system records the actual (as opposed to desired) observing state of the telescope. If there was an error setting the state of some specific element, the difference between the actual and desired states can be dealt with when analysing the recorded data.
  • Telescope telemetry: Sensor data, to be used for immediate response to faults (e.g., overheating) as well as being archived for later use as required when performing in-depth data analysis (e.g., the effect of beamformer temperature on system gain). This includes data from almost 800 temperature, voltage, current, and humidity sensors distributed through the equipment.

In the above descriptions, and throughout the rest of this page, the term ‘configuration’ refers to physical layout and connectivity – what is plugged into where. The configuration database tables are updated manually using a web interface, to reflect changes on site (which happen very rarely). The term ‘state’ refers to the settings that can vary from one observation to the next (frequency, attenuation, pointing, data capture modes). Future state descriptions (‘observations’) are added to the schedule tables using a command line tool. A few seconds before the start of each observation, the new desired state is sent to all the hardware components by the ‘schedule controllers’ (programs which run continuously, monitoring the schedule tables).

Video: Andrew Williams, instrumentation engineer who developed the monitor and control system for the MWA, speaks at PyCon 2017 about the telescope's software.

M&C Design Philosophy

From the very beginning it was assumed that as a ‘large N small D’ instrument, the MWA would need to be able to function normally with some of its parts (dipoles, tiles, receivers) not working, and that ‘not working’ could mean anything from physically disconnected through to ‘not satisfying some observation-specific criteria’. It was also assumed that the MWA would operate as a fully automated telescope, 24 hours a day, without any human ‘operator’ to act on error messages and make real-time observing decisions. These underlying assumptions drove the M&C design, and led to these design goals:

  • That it should be possible to determine the configuration, desired state, and actual state of the telescope both at the current instant, and at any time in the past, from the M&C database.
  • That the M&C system should be able to reach every hardware and software setting in the array, so that the same scheduling system can be used for both science observations and engineering tests.
  • That the M&C system should be able to change state rapidly – ending the current observation and starting a new one with no more than 8 seconds of latency.
  • That the schedule controller would do its best to carry out observations even if some elements (receivers, tiles) were not communicating, or had been flagged as non-functional, rather than failing with an error. The default action should always be to record data if possible, but record any errors (e.g., differences between desired and actual state) so that the usefulness of the data can be determined later.
  • That the specific choice of which tiles (and dipoles) to include in a given scheduled observation should be instantiated as the observation is actually started, not days or weeks in advance when it’s added to the schedule database. For example, one observation might specify that it should only use tiles that are in some ‘perfectly healthy’ set, and another observation might request all of the tiles. Allocating which tiles were in which ‘tile set’ would be done by the operations team, or by some automated analysis.
  • That the M&C system should be able to mitigate the effects of hardware failures. For example, the schedule can specify which individual dipoles in each tile should be switched out of the final sum for a tile, and which tiles should be physically powered down during an observation to prevent emitted RFI in some hardware failure modes.

Another main consideration was that resources were limited, and the M&C software was being written by a handful of scientists, none able to work on it full-time. This led to the decision to aim towards the use of existing open-source tools where possible, and to make the M&C system a collection of small, inter-operating tools rather than a monolithic system.

M&C Implementation

Database

All of the configuration and state data for the MWA is stored in a PostgreSQL database hosted on-site, and replicated in real time to a backup server at the Pawsey Supercomputer Centre in Perth.

Schedule Controllers

There are two daemons (programs running continuously) that do all the work required to send new observation state information to all the hardware and software components of the MWA. Both of these schedule controllers operate the same way – a few seconds before the start of each new observation, the data describing the new telescope state is sent to each of the clients attached to that controller. Each client then waits until the instant the observation is due to start, then changes its hardware state to match the desired new state.

  • Obscontroller: Written in Java, this controller communicates with the receiver enclosures on site, each of which digitises the signals from 8 tiles. ObsController only checks to see that the client has received the new state data, it doesn’t wait to see if the hardware changes succeeded. Instead, status information from each of the receiver enclosures (hardware errors, etc) is sent over the network to a separate ‘Status Monitor’ daemon, which records the actual state in the M&C database. ObsController also communicates with the software correlator and data capture system, to change correlator modes as set by the schedule, and to start and stop data capture for observations.
  • PyController: Written in Python, this accepts ‘registrations’ over the network from software clients that wish to be informed about state changes for one or more tiles. It was written to communicate with the MWA Phase II long-baseline tiles connected using fibre instead of coaxial cable, but it was designed to be flexible enough to accommodate a wide range of client types, such as external instruments like the Engineering Development Array. Status information (hardware failures, etc) is returned directly from each client, and recorded in the M&C database.

Receiver code on the MWA receiver control computers

The MWA receiver hardware is described in detail elsewhere (Tingay et al. 2013, Prabu et al. 2014). It runs on a single-board computer (SBC) that is now old, unsupported by the manufacturer, and not quite x86 compatible. This means that it’s limited to the existing operating system (Debian 6.0/squeeze, Linux 2.6.32 kernel, Python 2.6.6) and with limited processing resources. The hardware controlled by the SBC is attached using a mix of USB, I2C and pin-level digital IO connections. It is managed by eight small daemons, each carrying out a single task. Most are written in Python, but three (usb_control, gpio_control, and receiverClient) are written in C. For historical reasons, they communicate with each other through tables in a local PostgreSQL server instance running on the local SBC.

  • ‘magicstart’, handling remote control of powerup, initialisation (programming firmware, etc), and shutdown.
  • 'usb_control’, handling all communication with the ‘digital crate’ that does all data processing. Once the receiver has been powered up and initialised, this is limited to changing the frequency selection (which 24 coarse channels are sent out over the fibre), and reading a full 256 channel integrated power spectrum from one of the 8 tiles every 8 seconds. These power spectra are written directly to the telemetry database.
  • ‘gpio_control’, which handles all digital IO (turning pins on and off) using the dedicated GPIO card. This includes sending new delay settings to all the tiles, turning power supplies on and off and sensing their state, and air conditioner control (compressor and fan state).
  • ‘i2c_control’, which handles all serial communications over the several I2C buses inside the receiver. This includes reading values from a few tens of temperature, current, and humidity sensors every few seconds, as well as allowing control of the programmable attenuators on each of the sixteen input channels (8 tiles in X and Y).
  • ‘thermal_loop’ handles temperature control inside the receiver. It uses internal temperatures (read by i2c_control) to manage the air conditioner compressor and fan state as required (using gpio_control). It also shuts the digital crate power down entirely if any of the temperature sensors exceed predefined maximum limits.
  • ‘watchdog’ is a process that periodically resets the hardware watchdog timer on the SBC board if all functions in the receiver are working normally (temperatures being read, thermal_loop managing the air conditioner, etc). If the watchdog process stops, or fails to reset the hardware timer for more than a few tens of seconds, the onboard computer will automatically reboot.
  • ‘receiverClient’ communicates with ObsController and gets the new state data for each observation. It uses i2c_control, gpio_control and usb_control to send new attenuation, pointing and frequency data to the hardware as required, at the exact time each new observation starts.
  • 'sendStatus' is the process that sends the internal receiver state – frequency, attenuation, pointing, and temperature – back to the StatusMonitor process on the main M&C server, where it is logged in the M&C database.

Figure 1

Telemetry

The M&C system monitors quantities like temperature/voltage/current/power measurements at almost 8000 points. Some of this data is relevant to science and made available to be used during data reduction (beamformer temperatures at the start of each observation, full-spectrum power from each tile every few seconds), but most of it is only useful for fault detection and diagnostics (server fan speeds, free disk space, etc).

Most of the telemetry data is collected, stored, and visualised using an open-source package called ‘Graphite’ and stored in a ‘round robin’ style format, where measurements are saved at frequent intervals for some days or weeks, but automatically averaged to longer intervals for a longer period, and automatically discarded after some time. For example, server telemetry (motherboard temperatures, fan speeds) are saved at 5-minute intervals for 90 days, then this data is averaged to 1 hour and saved for 10 years, then discarded. The Graphite package allows us to graph these quantities in various combinations and formats, and build up ‘dashboards’ of frequently used graph layouts. This is used for monitoring the health of the telescope.

Telemetry data that is relevant to the actual data analysis (or may at some future time be relevant) is stored indefinitely in PostgreSQL tables on a separate server (not in the main M&C database).

Real-time telescope health monitoring

We use the open-source system Icinga2 (similar to Nagios) to monitor the telescope health in real time, including both traditional server farm diagnostics (disk space, rack temperatures, which processes are running, query response times, etc, on dozens of servers) as well as custom diagnostics (autocorrelation power for each tile, software correlator status, etc). Icinga2 has flexible web dashboard panels summarising the health (OK, WARNING, CRITICAL, or UNREACHABLE) of all the telescope components in a hierarchical structure. It generates emailed alerts to the operations team when there are problems, and takes action itself for time-critical faults (for example, powering down entire server racks if the temperature is too high).

Web pages and web services

There are a number of human-readable web pages used for telescope diagnostics, as well as machine-readable web services used by remote clients to collect metadata about observations. Most of these are Django services (available from the M&C server on site, and also from another Django server at the Pawsey Supercomputer Centre, using the local database mirror), but there are a few traditional CGI scripts as well as static HTML and image files updated automatically at regular intervals.