Message boards : Theory Application : Theory Task doing nothing
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42531 - Posted: 17 May 2020, 23:57:35 UTC

Ive gotten about four theory tasks today that seem to be nothing showing the vm console reveals this.





Top shows that nothing is running.

ID: 42531 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42535 - Posted: 18 May 2020, 15:01:55 UTC

I now have two more in my currently running tasks doing the exact same thing.
ID: 42535 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42611 - Posted: 24 May 2020, 17:02:39 UTC

Work up to 4 more doing that this morning along with some atlas tasks doing nothing.
Are there network problems at CERN?
ID: 42611 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,040
RAC: 11,356
Message 42613 - Posted: 25 May 2020, 4:43:14 UTC - in response to Message 42611.  
Last modified: 25 May 2020, 4:57:21 UTC

There must be something wrong with your Computer:
You have a sixtrack with x86(32-bit) and this was not finished:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=139948215
1. Check your OS
2. let only sixtrack running (prefs).

Edit: Sorry, there is a x86 Version running in sixtrack:
Microsoft Windows (98 or later) running on an Intel x86-compatible CPU
ID: 42613 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42615 - Posted: 25 May 2020, 11:54:14 UTC - in response to Message 42613.  

There must be something wrong with your Computer:
You have a sixtrack with x86(32-bit) and this was not finished:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=139948215
1. Check your OS
2. let only sixtrack running (prefs).

Edit: Sorry, there is a x86 Version running in sixtrack:
Microsoft Windows (98 or later) running on an Intel x86-compatible CPU


1. Running 64 bit windows.
2. I will switch to six track only here in a moment.

I don't think that task is blocking network connections off the top of my head sixtrack doesn't talk to the internet.
I was also able to find the task running away happily.

ID: 42615 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42616 - Posted: 26 May 2020, 15:10:08 UTC

Ran only six track for the day everything was fine. Now switching back to all projects will report if this continues to be an issue with theory.
ID: 42616 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,040
RAC: 11,356
Message 42617 - Posted: 26 May 2020, 16:19:33 UTC - in response to Message 42616.  

700 sixtrack and 5 with Error are shown. This is ok.
You have 64 GByte RAM and needed to control your PC when you mix Atlas and Theory.
Theory is not so difficult with the RAM as Atlas. You have 8 CPU for Atlas.
It is useful to control Atlas with a app_config.xml and less CPU's than 8 or not so many Atlas-Tasks in use,
because Atlas need a good control of the RAM.
Therefore is in the Atlas-folder of LHCathome a lot of help how to use it.
ID: 42617 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42618 - Posted: 26 May 2020, 16:39:51 UTC - in response to Message 42617.  
Last modified: 26 May 2020, 16:42:25 UTC

700 sixtrack and 5 with Error are shown. This is ok.
You have 64 GByte RAM and needed to control your PC when you mix Atlas and Theory.
Theory is not so difficult with the RAM as Atlas. You have 8 CPU for Atlas.
It is useful to control Atlas with a app_config.xml and less CPU's than 8 or not so many Atlas-Tasks in use,
because Atlas need a good control of the RAM.
Therefore is in the Atlas-folder of LHCathome a lot of help how to use it.



Got another one.

I'm pretty sure its not my ram either.
as I have more then enough.


Even when running all the atlas tasks I still have usually around 20 GB free.

Manged to get the task Id for this will. Will abort it and then check the error output.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=275253084
ID: 42618 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42619 - Posted: 26 May 2020, 17:56:04 UTC

2020-05-26 08:08:11 (19788): Error in stop VM for VM: -108
Command:
VBoxManage -q controlvm "boinc_83115c7c7bfa4ba2" savestate
Output:
VBoxManage.exe: error: Machine 'boinc_83115c7c7bfa4ba2' is not currently running

Im betting this is the problem it looks like it got interrupted by a bunch of new atlas tasks starting up.
ID: 42619 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1818
Credit: 122,900,938
RAC: 76,423
Message 42620 - Posted: 26 May 2020, 18:11:12 UTC - in response to Message 42619.  

Indeed.
Starting, pausing and restarting too many vbox tasks concurrently can result in an overloaded disk IO.
You may try to limit at least the number of concurrent ATLAS starts as each of them copies a few GB.
ID: 42620 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42621 - Posted: 26 May 2020, 18:22:45 UTC - in response to Message 42620.  

This should work right?

<app_config>
 <app>
  <name>Theory</name>
  <max_concurrent>28</max_concurrent>
 </app>
 <app>
  <name>ATLAS</name>
  <max_concurrent>2</max_concurrent>
 </app>
</app_config>
ID: 42621 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1818
Credit: 122,900,938
RAC: 76,423
Message 42622 - Posted: 26 May 2020, 18:44:56 UTC - in response to Message 42621.  

It's usually a minor problem to run many tasks concurrently but it can become a problem if they change their status.
This happens if you start/restart your BOINC client or even at shutdown when lots of data has to be saved to disk.
Modern computers with lots of cores are more affected as they run more tasks concurrently.

Nobody can really tell what's the best combination on your computer. You'll have to try it out.
ID: 42622 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42623 - Posted: 26 May 2020, 19:15:14 UTC - in response to Message 42622.  

It's usually a minor problem to run many tasks concurrently but it can become a problem if they change their status.
This happens if you start/restart your BOINC client or even at shutdown when lots of data has to be saved to disk.
Modern computers with lots of cores are more affected as they run more tasks concurrently.

Nobody can really tell what's the best combination on your computer. You'll have to try it out.


So this computer is my main server box all it does is LHC@home and every so often stream a movie to my tv.
As such its configured to run boinc 100% and does not suspend when the computer is in use. All the start stops
in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back
and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this
end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough
of a buffer to prevent it from starting and stopping tasks all the time.
ID: 42623 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1046
Credit: 6,601,227
RAC: 256
Message 42624 - Posted: 26 May 2020, 19:42:36 UTC - in response to Message 42623.  

All the start stops
in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back
and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this
end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough
of a buffer to prevent it from starting and stopping tasks all the time.
That's the hammer on the nail.
Since the last Theory update the fictive estimated runtime went from 100 hours to 10 days.
It would be the best solution if Laurence would fix this, but for the time being you may change it yourself by editing the Theory_2019_10_01.xml in LHC's project folder.
Change the job_duration value from 864000 into 360000.
ID: 42624 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42625 - Posted: 26 May 2020, 23:53:12 UTC - in response to Message 42624.  

All the start stops
in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back
and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this
end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough
of a buffer to prevent it from starting and stopping tasks all the time.
That's the hammer on the nail.
Since the last Theory update the fictive estimated runtime went from 100 hours to 10 days.
It would be the best solution if Laurence would fix this, but for the time being you may change it yourself by editing the Theory_2019_10_01.xml in LHC's project folder.
Change the job_duration value from 864000 into 360000.


I already made that change on your advice over in number crunching.
At the time I thought the issue was only limited to CMS tasks.

However it seem getting 1 day of work and then reducing the buffer to .25 days has fixed the issue as it effectively stops boinc from getting new ATLAS tasks.
I could probably get the same result with the no new work button.
ID: 42625 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,040
RAC: 11,356
Message 42627 - Posted: 27 May 2020, 9:28:30 UTC

You have successfull Tasks for ATLAS, CMS and Theory in the last days.

When you let only sixtrack and ONE Task with VM (ATLAS, CMS or Theory) running and all other VM-Tasks suspended.
Is this Task running normal and finishing correct?

There are many sixtrack for the other 31 CPU's atm.
ID: 42627 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42641 - Posted: 28 May 2020, 1:34:31 UTC - in response to Message 42627.  

You have successfull Tasks for ATLAS, CMS and Theory in the last days.

When you let only sixtrack and ONE Task with VM (ATLAS, CMS or Theory) running and all other VM-Tasks suspended.
Is this Task running normal and finishing correct?

There are many sixtrack for the other 31 CPU's atm.


Yeah this would also work. I've just kinda been more focused on trying to do as much work as fast as possible lol.

Allowing each vm based task to run one instance and then filling the rest with six track would probably be the best way going forward.
That or if I build a new computer and dedicate it to ATLAS only as it seems to be the problem child with its quick deadline dates.
ID: 42641 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42775 - Posted: 2 Jun 2020, 16:07:30 UTC
Last modified: 2 Jun 2020, 16:08:38 UTC

Ok got another one that was just stuck there with the same message.
This time it was not due to task switching.

Could it be due to the squid cache that I set up earlier?

Hopefully this will update to something more helpful then aborted by user.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=275990643
ID: 42775 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 62
Credit: 23,324,750
RAC: 21,032
Message 42781 - Posted: 2 Jun 2020, 18:37:59 UTC - in response to Message 42775.  

Ok got another one that was just stuck there with the same message.
This time it was not due to task switching.

Could it be due to the squid cache that I set up earlier?

Hopefully this will update to something more helpful then aborted by user.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=275990643


Just restarted squid for ATLAS, I'll see if this fixes the theory issues as well.
ID: 42781 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1818
Credit: 122,900,938
RAC: 76,423
Message 42782 - Posted: 2 Jun 2020, 18:53:41 UTC

If it happens again you may consider a project reset to ensure you get a fresh theory vdi.
ID: 42782 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Theory Task doing nothing


©2021 CERN