Message boards : Theory Application : Many errors - output file missing
Message board moderation

To post messages, you must log in.

AuthorMessage
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 247
Credit: 1,639,321
RAC: 0
Message 43324 - Posted: 9 Sep 2020, 11:50:00 UTC
Last modified: 9 Sep 2020, 11:55:18 UTC

I've just put three machines onto Theory, and have got 133 errors and 422 valid. They show as " Output file Theory_2390-1138776-40_1_r1119008954_result for task Theory_2390-1138776-40_1 absent"

Most of these are from one machine, which has a slower and smaller hard disk - could it be running out of space or taking too long to write to disk?

I shall upgrade it anyway.
ID: 43324 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1818
Credit: 122,903,014
RAC: 76,336
Message 43325 - Posted: 9 Sep 2020, 12:11:22 UTC - in response to Message 43324.  

If it is this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10651295


All tasks I checked stumble over this error:
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing.
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing file system status. (errno = '2')

It tells you that your disk I/O system is heavily overloaded and it takes too long to update a heartbeat file.
Hence, the corresponding watchdog shuts down the VM.

You may try to stagger the start of fresh tasks to give your I/O system enough time to catch up with the demand.
In addition you may check if the computer has enough RAM headroom to be used by the disk cache.
This value is dynamically allocated.
ID: 43325 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 247
Credit: 1,639,321
RAC: 0
Message 43326 - Posted: 9 Sep 2020, 12:38:23 UTC - in response to Message 43325.  

If it is this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10651295


All tasks I checked stumble over this error:
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing.
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing file system status. (errno = '2')

It tells you that your disk I/O system is heavily overloaded and it takes too long to update a heartbeat file.
Hence, the corresponding watchdog shuts down the VM.

You may try to stagger the start of fresh tasks to give your I/O system enough time to catch up with the demand.
In addition you may check if the computer has enough RAM headroom to be used by the disk cache.
This value is dynamically allocated.


Yes, that's the computer in question.

It has 36GB of RAM for 24 cores. Always plenty free.

It's the disk I'm blaming, it's only 80GB! It became full when I switched to Theory, so I cleared it out and turned on compression, this freed loads of space which was promptly used up by more Theory tasks.

The identical computer with a 1TB disk but a very slightly earlier motherboard runs fine. I'm going to put a 1TB disk on the dodgy one aswell.

By the way the older motherboard one kept crashing when running Theory, with a BSOD in Windows 10. A BIOS update sorted it. I guess Theory really taxes something somewhere.
ID: 43326 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 247
Credit: 1,639,321
RAC: 0
Message 43327 - Posted: 9 Sep 2020, 14:06:50 UTC - in response to Message 43325.  
Last modified: 9 Sep 2020, 14:07:29 UTC

If it is this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10651295


All tasks I checked stumble over this error:
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing.
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing file system status. (errno = '2')

It tells you that your disk I/O system is heavily overloaded and it takes too long to update a heartbeat file.
Hence, the corresponding watchdog shuts down the VM.

You may try to stagger the start of fresh tasks to give your I/O system enough time to catch up with the demand.
In addition you may check if the computer has enough RAM headroom to be used by the disk cache.
This value is dynamically allocated.


Just checked the SMART data for the drive (which I'm about to upgrade), health and performance (as per Speedfan diagnostics) are at ALMOST full, every other drive I own is at full. But it's showing read errors etc, where none of the others do. Maybe this is pissing off Theory. I'm not even going to put the drive on freecycle, best for the future of mankind if it gets retired to raw material salvage, it's going in the recycling bin, unless anyone wants to perform last rites first?
ID: 43327 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 247
Credit: 1,639,321
RAC: 0
Message 43345 - Posted: 12 Sep 2020, 23:17:26 UTC - in response to Message 43325.  

If it is this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10651295


All tasks I checked stumble over this error:
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing.
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing file system status. (errno = '2')

It tells you that your disk I/O system is heavily overloaded and it takes too long to update a heartbeat file.
Hence, the corresponding watchdog shuts down the VM.

You may try to stagger the start of fresh tasks to give your I/O system enough time to catch up with the demand.
In addition you may check if the computer has enough RAM headroom to be used by the disk cache.
This value is dynamically allocated.


80GB disk replaced with 2TB disk, fresh Windows install, no IO problems, yet it will not complete a theory task! They fail within seconds with the missing output file. What on earth is going on? This is an identical machine to another, which manages Theory just fine. Both machines have the latest Virtual box and extensions installed. Help!
ID: 43345 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 247
Credit: 1,639,321
RAC: 0
Message 43346 - Posted: 12 Sep 2020, 23:27:41 UTC - in response to Message 43345.  

If it is this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10651295


All tasks I checked stumble over this error:
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing.
2020-09-09 09:15:50 (16932): VM Heartbeat file specified, but missing file system status. (errno = '2')

It tells you that your disk I/O system is heavily overloaded and it takes too long to update a heartbeat file.
Hence, the corresponding watchdog shuts down the VM.

You may try to stagger the start of fresh tasks to give your I/O system enough time to catch up with the demand.
In addition you may check if the computer has enough RAM headroom to be used by the disk cache.
This value is dynamically allocated.


80GB disk replaced with 2TB disk, fresh Windows install, no IO problems, yet it will not complete a theory task! They fail within seconds with the missing output file. What on earth is going on? This is an identical machine to another, which manages Theory just fine. Both machines have the latest Virtual box and extensions installed. Help!


The stderror mentions vt-x is turned off. Weird. I'll check it tomorrow, I'm off to bed. No idea how the BIOS forgot to use vt-x. All that changed in there was I switched UEFI on then off again as it wasn't working so I had to stick to an MBR installation.
ID: 43346 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,180,711
RAC: 40,229
Message 43347 - Posted: 13 Sep 2020, 5:18:57 UTC - in response to Message 43346.  

They fail within seconds

The stderror mentions vt-x is turned off.
a task failing within seconds mostly indicates that vt-x is off
ID: 43347 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 247
Credit: 1,639,321
RAC: 0
Message 43352 - Posted: 13 Sep 2020, 12:51:23 UTC - in response to Message 43347.  
Last modified: 13 Sep 2020, 12:52:51 UTC

They fail within seconds

The stderror mentions vt-x is turned off.
a task failing within seconds mostly indicates that vt-x is off


I've checked the BIOS, and as I thought it's still enabled, this machine always used to run Theory fine. There's no reason only changing between UEFI and BIOS boot would have disabled it. I'm currently trying switching VT-x off, booting completely, then switching it back on again. If that fails, I'll have to see if there's something in Windows 10 that needs enabling. Do you know of anything? It's a fresh install of Windows 10. Of course maybe the motherboard is failing. It already refuses to use UEFI, it randomly can't find the boot sector.

By the way, shouldn't LHC have a better warning for an unknowing user that has this problem? Just failing the tasks seems daft. We need to be told "turn vt-x on please".
ID: 43352 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 247
Credit: 1,639,321
RAC: 0
Message 43353 - Posted: 13 Sep 2020, 13:12:08 UTC - in response to Message 43352.  

I've checked the BIOS, and as I thought it's still enabled, this machine always used to run Theory fine. There's no reason only changing between UEFI and BIOS boot would have disabled it. I'm currently trying switching VT-x off, booting completely, then switching it back on again. If that fails, I'll have to see if there's something in Windows 10 that needs enabling. Do you know of anything? It's a fresh install of Windows 10. Of course maybe the motherboard is failing. It already refuses to use UEFI, it randomly can't find the boot sector.


That worked. [shrugs shoulders] I turned it off in the BIOS, then turned it back on in the BIOS, and now it functions.

I'll update the BIOS anyway (which was going to be my next attempt at fixing it) - it's several versions out of date.
ID: 43353 · Report as offensive     Reply Quote

Message boards : Theory Application : Many errors - output file missing


©2021 CERN