Thread 'Theory Task doing nothing'

Author	Message
CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42531 - Posted: 17 May 2020, 23:57:35 UTC Ive gotten about four theory tasks today that seem to be nothing showing the vm console reveals this. Top shows that nothing is running. ID: 42531 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42535 - Posted: 18 May 2020, 15:01:55 UTC I now have two more in my currently running tasks doing the exact same thing. ID: 42535 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42611 - Posted: 24 May 2020, 17:02:39 UTC Work up to 4 more doing that this morning along with some atlas tasks doing nothing. Are there network problems at CERN? ID: 42611 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2286 Credit: 178,851,089 RAC: 2,010	Message 42613 - Posted: 25 May 2020, 4:43:14 UTC - in response to Message 42611. Last modified: 25 May 2020, 4:57:21 UTC There must be something wrong with your Computer: You have a sixtrack with x86(32-bit) and this was not finished: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=139948215 1. Check your OS 2. let only sixtrack running (prefs). Edit: Sorry, there is a x86 Version running in sixtrack: Microsoft Windows (98 or later) running on an Intel x86-compatible CPU ID: 42613 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42615 - Posted: 25 May 2020, 11:54:14 UTC - in response to Message 42613. There must be something wrong with your Computer: You have a sixtrack with x86(32-bit) and this was not finished: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=139948215 1. Check your OS 2. let only sixtrack running (prefs). Edit: Sorry, there is a x86 Version running in sixtrack: Microsoft Windows (98 or later) running on an Intel x86-compatible CPU 1. Running 64 bit windows. 2. I will switch to six track only here in a moment. I don't think that task is blocking network connections off the top of my head sixtrack doesn't talk to the internet. I was also able to find the task running away happily. ID: 42615 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42616 - Posted: 26 May 2020, 15:10:08 UTC Ran only six track for the day everything was fine. Now switching back to all projects will report if this continues to be an issue with theory. ID: 42616 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2286 Credit: 178,851,089 RAC: 2,010	Message 42617 - Posted: 26 May 2020, 16:19:33 UTC - in response to Message 42616. 700 sixtrack and 5 with Error are shown. This is ok. You have 64 GByte RAM and needed to control your PC when you mix Atlas and Theory. Theory is not so difficult with the RAM as Atlas. You have 8 CPU for Atlas. It is useful to control Atlas with a app_config.xml and less CPU's than 8 or not so many Atlas-Tasks in use, because Atlas need a good control of the RAM. Therefore is in the Atlas-folder of LHCathome a lot of help how to use it. ID: 42617 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42618 - Posted: 26 May 2020, 16:39:51 UTC - in response to Message 42617. Last modified: 26 May 2020, 16:42:25 UTC 700 sixtrack and 5 with Error are shown. This is ok. You have 64 GByte RAM and needed to control your PC when you mix Atlas and Theory. Theory is not so difficult with the RAM as Atlas. You have 8 CPU for Atlas. It is useful to control Atlas with a app_config.xml and less CPU's than 8 or not so many Atlas-Tasks in use, because Atlas need a good control of the RAM. Therefore is in the Atlas-folder of LHCathome a lot of help how to use it. Got another one. I'm pretty sure its not my ram either. as I have more then enough. Even when running all the atlas tasks I still have usually around 20 GB free. Manged to get the task Id for this will. Will abort it and then check the error output. https://lhcathome.cern.ch/lhcathome/result.php?resultid=275253084 ID: 42618 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42619 - Posted: 26 May 2020, 17:56:04 UTC 2020-05-26 08:08:11 (19788): Error in stop VM for VM: -108 Command: VBoxManage -q controlvm "boinc_83115c7c7bfa4ba2" savestate Output: VBoxManage.exe: error: Machine 'boinc_83115c7c7bfa4ba2' is not currently running Im betting this is the problem it looks like it got interrupted by a bunch of new atlas tasks starting up. ID: 42619 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 300,130,996 RAC: 45,310	Message 42620 - Posted: 26 May 2020, 18:11:12 UTC - in response to Message 42619. Indeed. Starting, pausing and restarting too many vbox tasks concurrently can result in an overloaded disk IO. You may try to limit at least the number of concurrent ATLAS starts as each of them copies a few GB. ID: 42620 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42621 - Posted: 26 May 2020, 18:22:45 UTC - in response to Message 42620. This should work right? <app_config> <app> <name>Theory</name> <max_concurrent>28</max_concurrent> </app> <app> <name>ATLAS</name> <max_concurrent>2</max_concurrent> </app> </app_config> ID: 42621 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 300,130,996 RAC: 45,310	Message 42622 - Posted: 26 May 2020, 18:44:56 UTC - in response to Message 42621. It's usually a minor problem to run many tasks concurrently but it can become a problem if they change their status. This happens if you start/restart your BOINC client or even at shutdown when lots of data has to be saved to disk. Modern computers with lots of cores are more affected as they run more tasks concurrently. Nobody can really tell what's the best combination on your computer. You'll have to try it out. ID: 42622 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42623 - Posted: 26 May 2020, 19:15:14 UTC - in response to Message 42622. It's usually a minor problem to run many tasks concurrently but it can become a problem if they change their status. This happens if you start/restart your BOINC client or even at shutdown when lots of data has to be saved to disk. Modern computers with lots of cores are more affected as they run more tasks concurrently. Nobody can really tell what's the best combination on your computer. You'll have to try it out. So this computer is my main server box all it does is LHC@home and every so often stream a movie to my tv. As such its configured to run boinc 100% and does not suspend when the computer is in use. All the start stops in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough of a buffer to prevent it from starting and stopping tasks all the time. ID: 42623 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1534 Credit: 10,042,485 RAC: 973	Message 42624 - Posted: 26 May 2020, 19:42:36 UTC - in response to Message 42623. All the start stops in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough of a buffer to prevent it from starting and stopping tasks all the time. That's the hammer on the nail. Since the last Theory update the fictive estimated runtime went from 100 hours to 10 days. It would be the best solution if Laurence would fix this, but for the time being you may change it yourself by editing the Theory_2019_10_01.xml in LHC's project folder. Change the job_duration value from 864000 into 360000. ID: 42624 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42625 - Posted: 26 May 2020, 23:53:12 UTC - in response to Message 42624. All the start stops in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough of a buffer to prevent it from starting and stopping tasks all the time. That's the hammer on the nail. Since the last Theory update the fictive estimated runtime went from 100 hours to 10 days. It would be the best solution if Laurence would fix this, but for the time being you may change it yourself by editing the Theory_2019_10_01.xml in LHC's project folder. Change the job_duration value from 864000 into 360000. I already made that change on your advice over in number crunching. At the time I thought the issue was only limited to CMS tasks. However it seem getting 1 day of work and then reducing the buffer to .25 days has fixed the issue as it effectively stops boinc from getting new ATLAS tasks. I could probably get the same result with the no new work button. ID: 42625 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2286 Credit: 178,851,089 RAC: 2,010	Message 42627 - Posted: 27 May 2020, 9:28:30 UTC You have successfull Tasks for ATLAS, CMS and Theory in the last days. When you let only sixtrack and ONE Task with VM (ATLAS, CMS or Theory) running and all other VM-Tasks suspended. Is this Task running normal and finishing correct? There are many sixtrack for the other 31 CPU's atm. ID: 42627 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42641 - Posted: 28 May 2020, 1:34:31 UTC - in response to Message 42627. You have successfull Tasks for ATLAS, CMS and Theory in the last days. When you let only sixtrack and ONE Task with VM (ATLAS, CMS or Theory) running and all other VM-Tasks suspended. Is this Task running normal and finishing correct? There are many sixtrack for the other 31 CPU's atm. Yeah this would also work. I've just kinda been more focused on trying to do as much work as fast as possible lol. Allowing each vm based task to run one instance and then filling the rest with six track would probably be the best way going forward. That or if I build a new computer and dedicate it to ATLAS only as it seems to be the problem child with its quick deadline dates. ID: 42641 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42775 - Posted: 2 Jun 2020, 16:07:30 UTC Last modified: 2 Jun 2020, 16:08:38 UTC Ok got another one that was just stuck there with the same message. This time it was not due to task switching. Could it be due to the squid cache that I set up earlier? Hopefully this will update to something more helpful then aborted by user. https://lhcathome.cern.ch/lhcathome/result.php?resultid=275990643 ID: 42775 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 64,977,798 RAC: 19,376	Message 42781 - Posted: 2 Jun 2020, 18:37:59 UTC - in response to Message 42775. Ok got another one that was just stuck there with the same message. This time it was not due to task switching. Could it be due to the squid cache that I set up earlier? Hopefully this will update to something more helpful then aborted by user. https://lhcathome.cern.ch/lhcathome/result.php?resultid=275990643 Just restarted squid for ATLAS, I'll see if this fixes the theory issues as well. ID: 42781 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 300,130,996 RAC: 45,310	Message 42782 - Posted: 2 Jun 2020, 18:53:41 UTC If it happens again you may consider a project reset to ensure you get a fresh theory vdi. ID: 42782 · Reply Quote