1) Message boards : ATLAS application : Running (0.517 CPUs) (Message 43524)
Posted 23 Oct 2020 by captainjack
Post:
Just got another one of these. Task says it is "Running(0.849 CPUs)"

More tasks get started than the system can support and the ATLAS task gets suspended. I had to put in an app_config to restrict some of the other work to get the ATLAS task to restart.

Does anybody besides me think this is a problem?
2) Message boards : ATLAS application : Running (0.517 CPUs) (Message 43390)
Posted 22 Sep 2020 by captainjack
Post:
Machine I used for this test has 12 threads and 15.9 GB of memory.

I let LHC download 6 single core ATLAS tasks and started them one at a time. After each task had time to download additional data and go through all the initiation steps, I checked memory usage and started the next task.

With one task running, Windows plus the task was using 6.6 GB of memory.
With two tasks running, Windows plus 2 tasks were using 10.8 GM of memory.
With three tasks running, Windows plus 3 tasks were using 14.8 GB of memory.
With four tasks running, memory usage got up to 15.8 GB, it started banging away on the swap file, the system locked up and rebooted itself.

No tasks initiated with use of a partial CPU.

Once the system came back up, I limited LHC to 3 concurrent tasks and will let them run to completion with no other tasks running.
3) Message boards : ATLAS application : Running (0.517 CPUs) (Message 43389)
Posted 21 Sep 2020 by captainjack
Post:
djoser suggested:

You could try to set those on hold (or even remove them from BOINC) and see how ATLAS behaves with no other projects interfering.

Good question. I will let the queue drain down to empty on one of my machines then let it download as many single CPU ATLAS tasks as it wants and see what happens. I already know it will be constrained by memory, but it will be interesting to see what happens.
4) Message boards : ATLAS application : Running (0.517 CPUs) (Message 43387)
Posted 21 Sep 2020 by captainjack
Post:
If anybody wants to look into this further, the Task number is 283389454 and the work unit number is 145065835.

I just aborted the task after it ran for 4 days 15 hours 56 min 18 sec without a successful completion.

I would be glad to help test a possible solution, just let me know when and where.
5) Message boards : ATLAS application : Running (0.517 CPUs) (Message 43385)
Posted 21 Sep 2020 by captainjack
Post:
djoser,

Thanks for the reply.

Yes, I am running the virtualbox version of ATLAS and yes, I have been through Yerti's checklist.

I have a screen capture of BOINC Manager with one of the ATLAS tasks in question that shows it using 0.517 CPUs that I would be glad to send to a project admin if someone will tell me where to send it.
6) Message boards : ATLAS application : Running (0.517 CPUs) (Message 43383)
Posted 21 Sep 2020 by captainjack
Post:
Additional information: It appears to me that since BOINC Manager on the desktop thinks that the ATLAS task only needs a fraction of a CPU, BOINC Manager will start more tasks than it has CPU's available. When all started tasks get going and ask for the resources they need, a task will get suspended then restarted frequently. When an ATLAS task has been running longer than any of the other tasks, it gets suspended and restarted more than any other task. When this happens, ATLAS tasks run much longer than they should if left uninterrupted.
7) Message boards : ATLAS application : Running (0.517 CPUs) (Message 43373)
Posted 19 Sep 2020 by captainjack
Post:
Got an Atlas task that shows that it is running using 0.517 CPUs. Is this normal?

If you need more info, please let me know.
8) Questions and Answers : Preferences : Use web preferences PROBLEM (Message 43318)
Posted 7 Sep 2020 by captainjack
Post:
Enzo Ricci,

The BOINC manager on your desktop will use the web preferences of the project (WCG or LHC) that were saved last. Since you added WCG, those are the preferences that were saved last. My suggestion would be to open up your preferences for LHC and save them (giving them the current time stamp) then issue an update in your BOINC Manager on your desktop.

Let us know if that works.
9) Message boards : Number crunching : Local control of which subprojects run`2 (Message 37276)
Posted 8 Nov 2018 by captainjack
Post:
pls,

The app_config.xml file does not control which tasks get downloaded. All the app_config.xml file does is control the manner in which the downloaded tasks run (number concurrent, memory size, etc.). The only way to control which tasks get downloaded is by using the project preferences.

A complete description of options for the app_config.xml file can be found at the bottom of https://boinc.berkeley.edu/wiki/client_configuration

Hope that helps.
10) Questions and Answers : Preferences : Stoping delivery of new tasks don't work in LHC@home and Seti@home (Message 36701)
Posted 13 Sep 2018 by captainjack
Post:
Carlos,

In the BOINC Manager app, when looking at the tab for "Tasks", there is a Command on the left panel for "Show Active Tasks" or "Show All Tasks". If you have it set to show only active tasks, then you may have tasks that are downloaded but do not show up until they become active. Please check to see if it is set for "Show All Tasks".

Hope that helps.
11) Message boards : Theory Application : Theory and app_config ? (Message 36670)
Posted 8 Sep 2018 by captainjack
Post:
You are welcome, glad you got it working.

One other clue that you might find useful in the future, it there is any doubt about which names to use in the app_config.xml, you can find many of the names for the current version in the client_state_prev.xml file located in the BOINC data folder.

Good luck with your crunching.
12) Message boards : Theory Application : Theory and app_config ? (Message 36667)
Posted 8 Sep 2018 by captainjack
Post:
Hi Yeti,

Try the following app_config.xml

<app_config>
  <app>
    <name>Theory</name>
      <max_concurrent>1</max_concurrent>
  </app>
  <app_version>
    <app_name>Theory</app_name>
    <avg_ncpus>1.0</avg_ncpus>
    <plan_class>vbox64_mt_mcore</plan_class>
    <cmdline>--nthreads 1</cmdline>
  </app_version>
</app_config>


I just put this together and it appears to be working running theory on one thread.

The two main differences are the plan_class and the addition of a cmdline for --nthreads.

Also, in your app_config, there are several parameters that I do not find in the BOINC documentation for client configuration http://boinc.berkeley.edu/wiki/client_configuration.

I do not know how the undocumented parameters in your app_config affect the client configuration, my suggestion is to use the minimum parameters until you find something that works then add parameters.

Also, just in case you didn't know, when you add parameters to an app_config, you can activate the new parameters by clicking on "Options", "Read Config Files" in the BOINC Manager. If you need to remove parameters, it is best to shut down BOINC and start it back up with the new parameters in place.

Hope that helps, let us know how it turns out.
13) Message boards : ATLAS application : VirtualBox 5.2 (Message 36363)
Posted 9 Aug 2018 by captainjack
Post:
Machine is running Windows 10 - 1803, VB 5.2.16 and BOINC 7.12.1.

Tried running a 3-core ATLAS, 2-core Theory, and a 1-core LHCb tasks at the same time.

The Stderr.txt for the ATLAS task had the following error message:
2018-08-09 12:57:11 (1988): Error creating VirtualBox instance! rc = 0x80004002

While the tasks were processing, the machine stopped running. When I restarted it, the Event Log had this message:

8/9/2018 3:41:13 PM | LHC@home | [error] no project URL in task state file

Atlas task finished and validated, Theory and LHCb tasks are still running.

Let me know if you need more information.
14) Questions and Answers : Getting started : Issues changing email address (Message 35244)
Posted 12 May 2018 by captainjack
Post:
I can't change my email address here or at the test site either. Maybe when they fix it here, it will work there too.
15) Message boards : ATLAS application : Download failures (Message 32824)
Posted 13 Oct 2017 by captainjack
Post:
The task fetch seems to ignore the parameter for "Max # CPUs". For computer 10476963 the Max # CPUs was changed to 2, but the server keeps sending 4 core tasks. The client_state.xml says

<app_version>
<app_name>ATLAS</app_name>
<version_num>101</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>4.000000</avg_ncpus>
<max_ncpus>2.000000</max_ncpus>


Seems odd that the max_ncpus is 2, but the avg_ncpus is 4.

Computer is using the default preferences.

Please let me know if I can provide more information.
16) Message boards : Number crunching : Less boinc credits than on other projects? (Message 30502)
Posted 26 May 2017 by captainjack
Post:
RaimundD,

Just to make sure you know, WCG takes the BOINC points and multiplies them by 7 to get WCG points. If you want to know how many BOINC points you get at WCG, you can check one of the accumulator web sites like boincstats.
17) Message boards : ATLAS application : New app version 1.01 (Message 29178)
Posted 10 Mar 2017 by captainjack
Post:
Just tried one on Linux. Task ran for about 20 minutes then got this:

2017-03-10 12:49:18 (8776): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_5904_1489171051/PandaJob_3273309522_1489171055/athena_stdout.txt -
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-10 12:38:27,950 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-10 12:38:27,952 INFO Now writing wrapper for substep executor EVNTtoHITS
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-03-10 12:38:27,952 INFO Valgrind not engaged
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-10 12:38:27,952 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe.execute 2017-03-10 12:38:27,952 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe.execute 2017-03-10 12:46:25,442 INFO EVNTtoHITS executor returns 65
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe.validate 2017-03-10 12:46:26,351 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.trfExe.validate 2017-03-10 12:46:26,365 INFO Scanning logfile log.EVNTtoHITS for errors
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.transform.execute 2017-03-10 12:46:26,588 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-03-10 12:49:18 (8776): Guest Log: PyJobTransforms.transform.execute 2017-03-10 12:46:29,792 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")

Task number 124796186

Let me know if you need more info.
18) Message boards : LHCb Application : Low CPU usage (Message 28087)
Posted 8 Dec 2016 by captainjack
Post:
Getting nothing but these error messages.

2016-12-08 15:01:50 (22444): Guest Log: [INFO] Job finished in slot1 with unknown exit code.


And no CPU usage.

Turning these off until I hear that they are working again.
19) Message boards : LHCb Application : Condor exited after 608s without running a job (Message 27989)
Posted 28 Nov 2016 by captainjack
Post:
Looks like it is working for me now. The task has made it past the 608 second mark and is using a full CPU thread.

Thanks for getting the image updated.

Will post again if anything changes.
20) Message boards : Number crunching : "New" project, old problem (LHCb) (Message 27971)
Posted 27 Nov 2016 by captainjack
Post:
jjv,

Yes it is a known problem and has been reported on the "LHCb Application" topic. The virtual machine can't communicate with the HTCondor server so it waits 600+ seconds then aborts. My recommendation would be to turn it off and monitor this post https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4014&postid=27898 to see when the project admins get it fixed.


Next 20


©2021 CERN