Article #828: ECN services outage - ITaP Research Computing systems impacted
Engineering Computing Network (ECN) will be performing staged patching and reboots of all of ECN's RedHat Linux workstations and servers to protect ag...
Engineering Computing Network (ECN) will be performing staged patching and reboots of all of ECN's RedHat Linux workstations and servers to protect ag...
The Depot filesystem checks have all completed cleanly and the Depot has been fully returned to normal operations. All queues on all clusters are sch...
As of 9:15 PM, the Snyder and Rice clusters have been brought back into service after cooling was brought back online. Front-ends are operational and...
There was an issue with the cluster's gateway switches, causing infiniband traffic to be incapable of IP over infiniband. This also caused an instabil...
Most of the impact of this turned out to be to the Depot storage system, which has now been restored to normal operations. All the other affected sys...
The cause of this turned out to be a power loss to Carter's scratch filesystem and portions of the Data Depot, which has been restored now. Carter no...
The underlying issues affecting Carter are resolved and job scheduling has been resumed. Many individual nodes remain offline for corrective action,...
The Hammer scratch filesystem has now returned to normal operations. Original Message: During the maintenance of the Rice and Snyder clusters this wee...
As of 12:46, December 2, the home filesystem serving Conte, Hammer, Hansen, Hathi, Peregrine1, Radon, Rice, and Snyder was restored to normal operatio...
The scratch filesystem serving Hammer, Rice, and Snyder has been restored to normal operations, and all queues have been re-enabled. Original Message:...
The scratch filesystem has been restored to full service and all queues have been restarted. Original Message: The scratch filesystem serving Conte i...
Update - 9:20pm Conte has been returned to full production as of 9:15pm. During the failure earlier today, the internal tracking of jobs within the sc...
Service was restored around 7:30pm today. Engineers changed the way Samba authenticates users to avoid this problem going forward. -- Service was rest...
October 30, 2015 11:00am ITaP Engineers have made additional timeout changes to the scratch filesystem which has increased stability. Additional work...
**Update: August 25, 2015 9:00 pm ** On Monday, August 24, a disk tray in the Rossmann scratch storage system suffered multiple failures and despite g...
UPDATE As of 8pm on August 15, 2015 the scratch filesystem serving Rossmann is back in full production. Original message: The scratch filesystem servi...
Due to power work in the MSEE building, most ECN services will be unavailable between 6:30am – 9:00pm EDT on Saturday, August 15, 2015. For Research C...
ITaP engineers have identified issues causing intermittent failures on Carter. Engineers are currently tuning parameters on Depot system that have bee...
Update: The scheduling server has been rebooted and job submissions appear to be working normally again. Please let us know at rcac-help@purdue.edu if...
Due to power work in the MSEE building, most ECN services will be unavailable between 5:30 pm Thursday, 11 June, 2015 and 8:00 am Friday 12 June 2015....