User Tools

Site Tools


operations:starting_monitoring_9.10.5

Starting and Monitoring IVS sessions

Always try to start the schedule at least 5 mins in advance of the first scan.

If the scheduled start time is in the future, start the schedule with

schedule=r4447hb,#1

You can schedule commands in the field system through the operator input with a command like

!2014.293.16:00:00

(or

!+5h

(for 5 hours from now). Any commands entered after this will not be carried out until the specified time. NB - this can make it look like the FS has locked up! You can remove any queued commands with the

flush

command, entered in the operator input.

If you are starting late (or re-starting for some reason), start the schedule with

schedule=r4447hb

if you try to start not very late (i.e. within five minutes of the scheduled start) you may get and get errors like; m5 -900 no scans; m5 -900 not while recording or playing; m5 -900 can't get device info. The mk5 is looking for the previous scheduled scan to check, which doesn't exist, and can't start the next scheduled scan recording. In this case start schedule with start line number

schedule=r4447hb,#24

Otherwise: do NOT specify a start line number if starting late.

Then send a start message to IVS.

Every two hours during the experiment eRemoteControl will bring up a checklist. Please run through the checklist when it appears. A description of the checklist parameters and what to look out for is available here:

Please also edit and update the Handover Notes page with any information on current issues, recent problems etc that need to be passed on to the next observer.

The system monitor provides a useful summary of the drives & other Monica parameters. You can run it on ops2, ops4 or ops5 with the command monitor_system.pl or from the drop-down menu on the desktop “Applications → AuScope → Monitor System”. Run it once for each site.

You can launch clock displays too from the “Applications → AuScope” menu.

It's best to only have the Katherine and Yarragadee VNC sessions running when running through the checklist and to rely on eRemoteControl, the system monitor and the log monitor the rest of the time. You should use the VNC sessions to check that the autocorrelation spectra are OK though.

This web page provides a one-page summary of the webcams and weather radar.

The PC in the “lounge” (ops6) runs the same PCFS log monitor that ops4 does. To start it, double-click on the “Log monitor” icon on the desktop and then open the same log file that econtrol is writing to. Run it once for each station.

Other things to watch out for

Antenna control and monitoring

  1. Sometimes the connection between the FS and the antenna controller can be lost. If this happens the alarm will sound, and you'll see messages like this:
    2012.297.03:58:54.01?ERROR st -998 reading SystemClock1
    2012.297.03:58:54.01?ERROR st -999 TCP/IP connection was closed by remote peer
    2012.297.03:58:54.01?ERROR st   -5 Error return from antenna, see Mbus error.

    and you'll need to re-establish the connection by typing:

    antenna=open (this re-opens the connection)
    antenna=status (if the connection has re-established, you will see several lines of status messages)

    If it still doesn't work, from a terminal

    ping syshb

    . If syshb is not “pingable” it might be a communication issue, contact the on call person and tell them.

  2. We occasionally get an alarm from the mark5 like this:
    2012.312.18:08:21.42?ERROR m5 -900 Probably no such disk 
    2012.312.18:08:21.42?ERROR m5 -904 MARK5 return code 4: error encountered (during attempt to execute)

    It can be ignored. It seems to be linked to disk modules that contain less than the maximum number of disk drives (8).

  3. We sometimes get this error (or similar):
    2012.312.01:50:32.28#trakl# Computer time window is 270 milliseconds
    2012.312.01:50:32.28?ERROR st  -24 Computer time window exceeded 0.25 seconds, see value above.
    2012.312.01:50:32.28#trakl# ACU Time window is 252 milliseconds
    2012.312.01:50:32.28?ERROR st  -26 ACU time window exceeded 0.25 seconds, see value above.

    This is probably a network issue and only seems to happen when we’re running eTransfers. It can be safely ignored.

  4. The following error message is sometimes seen and is usually related to the antenna slewing or settling on source:
    2012.314.18:02:02.16#antcn#Antenna outside nominal tracking tolerance of 0.0045 degrees, current tolerance 0.0100.

    If this persists then there may be a tracking problem but if it only occurs before/after slews or in strong winds (see next note), it can probably be safely ignored.

  5. On windy days we have seen error messages reporting the pointing is outside the tolerance level set in the FS. This seems to be the antenna controller reacting to wind gusts and seems to affect azimuth only. Not much can be done about this. The ptol command can be used to adjust the tolerance to a higher level but in general it's probably better to just put up with it.
  6. Wind stows. The antenna should automatically recover from wind stows without intervention
  7. Starting up. If the antenna has been used under one of the following conditions, it will not begin to move until the command antenna=operate has been given:
    1. Used under HMI (e.g. a SpaceX track, maintenance),
    2. Has been used in local mode (from the hand-box),
    3. Has been driven manually out of hard limit
    4. Drive not enabled (e.g. is an Emergency stop has been triggered and released)
    5. Drives have been in a Run Not Permitted state (drive or brake problems)
    6. Power switch is on but drives not energised

Recovering from power failures

To reinitialise the connection to the antenna:

antenna=open

To get the antenna moving again:

antenna=operate

Antenna stuck

When the antenna is stuck, launch the VNC to the timepc and go to the antenna control window. Look for red buttons there. Set the schedule to halt in the fs input. You can turn on/ff the antenna manually from operate/standby, switch Drives on/off. If necessary, do the RESETS at the bottom.

After turning the drives on, it is recommended to wait a little while before putting the antenna in 'operate'. This step is crucial if the previous steps haven't resolved the issue.

DBBC Crash: ERROR db -21....

Please see the instructions on http://auscope.phys.utas.edu.au/opswiki/doku.php?id=operations:documentation:dbbc_restart for restarting the dbbc if it dies.

Re-starting observations, or starting late

Starting a schedule file with no additional arguments will start the observations according to the schedule, with the first observation beginning no earlier than 5 minutes from now. This is usually the best option. If you want to specify a particular part of the file to start in then you can do it as follows (taken from the manual):

            Syntax:     schedule=name,start,#lines

            Response:   schedule/name,line


Settable parameters:
        name    Name of schedule file to be started. If no directory
        path is specified, /usr2/sched assumed. If no
        extension is specified, .snp is assumed. Any
        currently-executing schedule file is closed, and the
        new schedule file is opened. If the new file cannot be
        opened, there will be no schedule active. When a valid
        schedule is started, a cont command may be necessary.
          start     Place in the schedule to begin executing. May be one
                    of the following:

                null   to start with the observation beginning no
                       earlier than 5 minutes from now.

                #line  for a line number in the file, should be a
                       source command.

                time   to start with the observation beginning no          
                       earlier than this time. time is in standard SNAP
                       format.
                       
                #lines Number of lines to execute before automatically
                       halting. Default is the remainder of the schedule.

Monitor-only parameters:

          line      The line number to be executed next.

Comments: 
If the schedule is started successfully, a log file having the
same name as the schedule is automatically started, and the
procedure file having the same name as the schedule is
automatically established as the schedule procedure library.
Any previously time-scheduled procedures from this library are
cancelled. If a # of lines is specified, an automatic halt
will be issued after execution of these lines. The schedule
may then be continued using the cont command.

"rfpic" or rfpcn problem

If you receive a persistent “rfpcn: error opening, rfpic probably not running, see above for error” report, or notice that the recording is notably behind the summary file and the becklog grows, you might want to restart Rxmon.

Log in to pcfs as root and perform the following command:

pcfsyg:~#  su
pcfsyg:~#  /etc/init.d/Monica.Rxmon stop
pcfsyg:~#  ps -ef | grep Rxmon
pcfsyg:~#  /etc/init.d/Monica.Rxmon start

If the command worked, you will see the parameters listing.

The “ /etc/init.d/Monica.Rxmon start” command may not work as “ERROR on binding: Address already in use”. Just wait a minute and repeat an attempt. If it still doesn't work restart Monica:

pcfsyg:~#  /etc/init.d/Monica.monica stop
pcfsyg:~#  /etc/init.d/Monica.monica start

The system monitor will close and need to be reopened.

scan_check reports a data format problem, i.e. The "E" problem

Below is an edited description of the problem and how to fix it. The log monitor software should ring an alarm if it occurs but periodic checking of the scan_check output is also advised. The problem seems to occur either when there is a problem with a disk in a module (e.g. poor write speed) or when a Mark5 configuration command (e.g. fmset or mk5b_mode) is sent while recording.

To: Stations with Mark 5B/5B+ Recorders

From: Ed Himwich, Dan Smythe, and Rich Strand

Date: 8 May 2012

Re: Mark 5B/B+ recorder ",E" errors from "scan_check"

INTRODUCTION

It has been recently noticed with Mark 5B/5B+ recorders that sometimes 
a ",E" occurs at the end of the "scan_check" response. This indicates 
a problem with the data format on the disk. It is not entirely understood 
what causes this problem, but when it occurs, the scan with the error is 
unusable. If it occurs all the time, corrective action is needed. We list 
below steps to take to deal with this problem if it occurs at your station.

Please be aware that once this error occurs persistently on a module, it 
is not safe to record any more data on that module. Set the module aside, 
appropriately labeled, and in its place use an empty module that has been 
erased/condition in your recorder.

After changing modules, you should test with the new module using the 
"recscan" procedure

The next section gives a complete procedure for recovery.

RECOVERY PROCEDURE

If the ",E" error is occurring persistently, please take the following actions:

(1) Halt the schedule:

     disk_record=off
     halt

   When there is a problem with any Mark 5 Recorder, it is rarely
   helpful to terminate the FS. It's not good to halt the schedule
   or terminate the FS while recording, you will fill the disk. The FS
   will try to prevent you from terminating while recording, but it
   won't try to stop you from halting the schedule while recording,
   so please try to avoid that.

(2) It is now necessary to swap to a fresh, blank module. 

    If the other Mark5 bank contains a blank module:
    (2.1) Select it using the command "mk5=bank_set=inc"
    (2.2) Ask a local to remove the module that is having the error and label it
          appropriately.
   
    If the other bank is empty or contains a module with data that should be kept:
    (2.3) Ask a local to remove the module that is having the error and label it
          appropriately.
    (2.4) Ask a local to insert an empty module in this recorder, preferably already
          erased/conditioned.
    (2.5) Please make sure the new module is the one selected for recording.
          If not, use "mk5=bank_set=inc" to change which is selected.

(3) Once you have verified that the new module has been selected, erase
   it with:

     mk5=protect=off
     mk5=reset=erase

   (standard procedure for any fresh Mark 5 module)

(4) Test the new module with:

     recscan

(5) If the scan_check from (4) does not show the ",E" error, the problem
   is probably resolved.

(6) After verifying, again, that the new module that you used in step
   (4) is selected, erase this new module:

     mk5=protect=off
     mk5=reset=erase

(7) If the problem was resolved in step (5), you can rejoin the
   schedule at the next opportunity using the "schedule=..." command.
   Do not use the "CONT" command since this will attempt to observe
   the scans you missed since you entered "halt". You can (should) use
   the new module that you used for step (4) and erased in step (6) to
   continue the schedule.

   If the problem was not resolved, please contact the on-call person. 
   The next step would be to try a complete restart of the Mark 5 with 
   a power cycle to see if that helps. You can also try another new module.

(10) If there are some scans on the "bad" module without the error, it 
    should be sent for correlation.

Drive PC Failure

If the DrivePC fails, you will probably get an error like this:

00:40:34#antcn#Error: Cannot get monitor info from antenna (8020002)
00:40:34#antcn#Network I/O Timeout occurred on read/accept

This means you will need to restart the DrivePC. To do this:

1. Open a VNC session to Newsmerd.

2. Open a terminal on Newsmerd.

3. Enter the command

rem_reboot -r rakbus sys26m

4. Ping the DrivePC to wait until it's running. When that happens, enter (into the field system)

source=disable

This creates a new socket connection from the field system to the Drive PC.

5. Then, re-enter the schedule to start it going again to make sure it goes to the right source. Be sure to check that the drives are coming on. You can check if it's moving by typing

onsource

Common problems

Formatter to FS time offset

You might get a

ERROR sc  -13 setcl: formatter to FS time difference 0.5 seconds or greater 

to fix this do a:

sy=run setcl offset

Note this error is likely to reappear regularly.

Note also that the error message

?ERROR sc  -18 setcl: program is already running, try "run setcl" instead.

has been seen recently when the command is issued from a terminal window. The problem has not been seen when the command is entered into the oprin window. If you do get this error when entering the command into the oprin window, please tell Jim.

FS time is out by minutes, hours, months or years!

The time or date in the field system log output is very wrong, but all the time settings appear OK, that is; the FS time in fmset is correct, the pcfs[hb][ke][yg] date command reports the correct time etc. The problem is the field system actually uses the hardware BIOS time from the pcfs[hb][ke][yg] computer, not the operating system time. If all the times appear OK but the field system is still incorrect then you will need to fix the hardware BIOS time setting. To read the hardware time, (and the difference from the system time), as root user;

hwclock -r

The system time comes from a local GPS receiver which runs an NTP server. Check that the pcfs[hb][ke][yg] system time is indeed correct;

ntpd -nq

The offset from the first server in the list should be less than 10 ms. Then write the current system time to the hardware clock, as root user;

hwclock -w

That the hardware clock has gone wrong probably indicates a fault, such as a bad BIOS battery on the motherboard that needs replacing.

FS time is out by several seconds

The origin of this problem is presently unknown but the FS time can get seriously out of step. To fix this, while not recording start the fmset program from an oper@pcfshb terminal and issue the “+” and “-” commands, then quit from fmset (ESC). Restart fmset and the FS time should now be correct. You may need to resync the mark5B pps after this procedure.

clkoff reading is drifting or far from the maser-GPS offset

The clkoff command measures the difference in the 1 PPS (pulse per second) signal coming from the GPS with the 1PPS from the Mark5. The Mark5 1PPS has travelled through both the DBBC and Mark5 and is a good diagnostic of a timing problem in our hardware.

There are occasionally timing glitches (clock jumps) that cause the clkoff value to change. There are several possible causes:

  1. Spurious signals on the 1 PPS signal. For example at Yarragadee we sometimes see a clock jump when the antenna drives are powered on. We also sometimes see it as a result of poor earthing or a bad connection in the cable between the DBBC or Mark5
  2. DBBC problem. Sometimes the DBBC (which uses the 1PPS from the maser and passes it's timing on to the Mark5) can become unstable and the 1PPS signal will start to drift.

The easiest way to check for clock stability is to compare the clkoff and maserdelay values. The difference between these two should remain stable at around 0.3 us. The Log Monitor software calculates the difference and logs it as the “Delay difference”. If this value exceeds abs(0.5) us, an alarm is sounded (by default).

So what do I do if there's a clock jump?

The first thing to do is not panic. If the new delay remains constant and less than abs(20) us, the correlator can handle it. Re-setting the delay introduces another clock jump which makes the correlation more difficult. So the first thing to do is in the Log Monitor:

  1. Press “Acknowledge alarm”
  2. Under the “Configure” menu, select either:
    1. “Delay monitoring → Audible warning” which will make the monitor software beep every time it sees a > abs(0.5) us offset, rather than sound the alarm, or…
    2. “Delay monitoring → Silent warning” which will log that the offset is large but not beep or ring alarms. This should be used with caution!
  3. Now monitor the Delay difference and see if it has stabilised. You can do this in several ways:
    1. Watch the Delay difference values in the log monitor window. You can get more frequent updates by issuing regular clkoff and maserdelay commands from e-RemoteCtrl
    2. Get Log Monitor to extract a history of the delay and delay difference values by pressing the “Export Data” button. When you do this, several ascii files will be written to /vlbobs/ivs/logs. The file that will be of most interest is (e.g. for Yarragadee) /vlbobs/ivs/logs/yg_ddif.txt. You can open this file and read it's contents, or you can use a plotting program like gnuplot to plot the values. This is especially useful if you want to see if the new offset is stable or not:
      1. from a terminal window:
        cd /vlbobs/ivs/logs
        gnuplot
        plot 'yg_ddif.txt' with linespoints

        This will plot the delay difference against day number. You can use the right mouse button in the plot window to zoom in. Sometimes a spurious data point will make the graph painfully small, this example gnuplot command

        set yrange [-0.3:-0.275]
        replot

        will put the y-axis in the ball park for you. Change the numbers to suit the current offset. The command

        set xrange [*:*]
        replot

        will put the x-axis back to the full range of the datafile if you've zoomed in with the mouse.

Every time you press “Export data” the output files are refreshed and you can replot the values in gnuplot either by typing 'replot' or by pressing the “Replot” button in the plot window. Other possible useful files to plot are yg_maser2gps.txt, the difference between the maser and GPS 1PPS, and yg_fmout.txt, the difference between GPS and Mark5 output 1PPS.

So when do I need to reconfigure the DBBC, run fmset etc?

If the delay difference is stable you don't need to do anything.

If the delay difference is more than 20 us, or gets so large that the clkoff or maserdelay values lose precision, run fmset to get the delays back to something manageable. Make sure you are not recording while running fmset! Issuing a halt command from e-RemoteCtrl followed by disk_record=off is usually a safe method.

If the delay difference is drifting (usually linearly), the DBBC probably needs reconfiguring. This can be done from e-RemoteCtrl as follows (again, best to halt the schedule and make sure you're not recording):

dbbc=reconf

Monitor how things are going in the DBBC VNC session. A reconfig takes about 2 minutes. When it's completed, synchronise the dbbc:

dbbc=pps_sync

Then in a terminal window on pcfs[hb|ke|yg], run fmset to get the clocks lined up.

Now resume observations with a cont or schedule= command.

If a reconf does not stop the clock drift, try rebooting the DBBC (using the windows start menu) and restarting the DBBC Server.

PCFS log window reports problem with ReadPower.sh

This occurs when communication with the power sensor (a USB device) in the IF rack is lost. The power sensor is required for System Temperature (Tsys) measurements. The solution is to firstly disable Tsys measurements, then cycle power to the sensor using the Internet Power Switch, then check that it's working and lastly re-enable Tsys measurements.

You can disable the Tsys measurements as follows (Hobart is used as an example here):

On pcfshb:

pfmed
pfmed: pf,station
pfmed: ed,systemp12

An editor will start. Comment out the command by putting a double-quote at the start of the line. It should then look like this:

"sy=/usr2/oper/systemp12rcp.sh &

Now exit the editor, and

pfmed: exit

Lastly, please make a note in the log that Tsys measurements have been disabled.

Next kill any remaining systemp12rcp.sh or ReadPower.sh processes running on pcfs[hb|ke|yg] (use ps -ef | grep ReadPower to identify the process IDs). Become root with su and issue the command

/etc/init.d/AgilentU2000 restart

It will run a series of procedures to toggle the power and then try to re-establish communications. It may take two tries to get it fully working - when it is ok, you should get a blithely cheery message to this effect, and be wished good luck. When you receive this message, wait for a break in the recording and test the power sensor by running /home/oper/systemp12rcp.sh. All being well, there should be no timeouts although the measured power is likely to be nonsensical (there will be bogus values written into the data from the previous timeouts). If it fails with timeouts, persevere with the /etc/init.d/AgilentU2000 restart procedure. Once you have it working, repeat the pfmed process and remove the comment from the systemp12 procedure.

If econtrol gets closed during an observation

Recording continues as econtrol is a front-end viewer for the field system, so don't panic :)

When you restart econtrol from the menu it may be unable to load the telescope information (the drop-down menu boxes), and the terminal from which econtrol runs produces “Can't open interface” type errors. If this happens, in the econtrol window (the green one, not the terminal) press Control+shift+e, and then try to open one of the drop down boxes again - this time the icon in the bottom right corner should go from red through 'connecting' to green, the information will now load, and observing can continue as normal.

When econtrol is back, check that there is a green bar above the red dot, second icon to the right of the text entry field. This indicates that a log file is being recorded. If there is not a green bar on this button, press the button and specify an appropriate filename in /vlbobs/ivs/logs. Then in the Log Monitor, choose File > Open Log File and select the new file. Make a note that there are two log files.

If econtrol can't connect

Recording continues as econtrol is a front-end viewer for the field system, so don't panic :)

If the econtrol program can't connect even after repeated Ctrl+Shift+e commands, you should check to make sure that the econtrol daemon is running on the pcfs machine. Log in to the pcfs and run ps -ef | grep econtrold. There should be two entries in the list. If it's not running then start it with /usr2/econtrol/bin/econtrold. If it is running at first but you still can't connect, try killing the econtrold processes and restarting it.

If this still doesn't work, try killing the ercd process as root on the pcfs and then press Ctrl+Shift+e in econtrol.

If All-Sky Cam goes offline

If the All-Sky Cam goes offline at Katherine, open the timeke vncviewer and open the folder C:\thumbs. In it, there is a script called ftp_script. Double click on the shortcut to restart the script and the All-Sky Cam should run again.

Network connection lost to remote sites

If you're having trouble connecting to all computers at a remote site (i.e. Yg or Ke), the VPN connection between the site and UTAS may have been disrupted. Often this happens when VPN hardware is reset at UTAS. While the routers at the sites are setup to re-establish the connection back to the UTAS network, sometimes this can take quite a while or not work at all.

ITS have given us an account to on routers, but it can only be used when VPN connection is lost. Fortunately, there are some computers at the remote sites which connect to the outside world without going through the UTAS VPN. We can use these machines to log-in to the confused router and reset the VPN connection — assuming the physical connection to the site is still available.

To connect to the accessible computers at the site open a VNC session by running, on ops2, the command: vncviewer $PC, where is $PC is either ke-via-cdu or yg-via-internode. The password is the usual. (Rather than publish the IP addresses of these computers to the www here, I've instead written them in /etc/hosts on ops2. The above name will therefore only work on that machine.)

Now you will need to login to the UTAS router at the site via this computer. Find the PuTTY program (its icon is two connected computers). When you open it, you should see a list of “Saved Sessions”. Select '131.217.61.1' for Ke or 192.168.1.61 for Yg, the press Load then Open. Now you should now be presented with a black login window for the router. The username is physics and the password is connect.

If all goes well, you should now have a shell into the router. To reset the VPN connection, type the command clear crypto ipsec client ezvpn. You can now exit this shell and close the VNC connection.

The connection should be restored fairly quickly. While you wait, you could try pinging the pfcs at the site.

/home/www/auscope/opswiki/data/pages/operations/starting_monitoring_9.10.5.txt · Last modified: 2017/06/23 03:24 by Arwin Kahlon