21) Message boards : Number crunching : Issues Controlling Number of Threads Used (Message 42146)
Posted 12 Apr 2020 by Gunde
Post:
Not sure why I saw one of the 1950X systems running more threads than the system has. Well actually the sum of the MT allocation for the Atlas tasks plus the other tasks was greater than number of threads, but maybe actually used was lower. When in this state, it would sporadically stop tasks and indicate it was waiting for memory. Seems ok now Perhaps an app_config.xml is needed here.


Have experience this also with virtualbox that are MT. It would mostly correct itself after few minutes when vm machine start up and running jobs. In this stage it could break but due to fast switching on or off.
ONLY use app_config if sure you need a change anything to how boinc should handle application. It's been common use to limit ram or threads but not needed today. Default is mostly the best and effective enough. I use to 4 threads to atlas but settings in profile would be same. Only need today for me is set max task concurrently running when would mix contribution to other projects.

I did have an VM become corrupt on the Epyc system. Not sure why, but perhaps an upgrade to virtualbox is needed. What is the best version people are using on Ubuntu?


Your Epyc system looks great and (Version: 5.2.34) is solid enough for success it would not be better then it is now (Valid (804) · Invalid (0) · Error (1)). If update you would experience other issues.
My experience on big system is that later versions could handle many vm machines running concurrently so stay safe and don't update if works.
Both Linux and windows have suffer on this.

Boinc client would handle well in normal operations as long as doesn't get a bulk task starting and stopping. This could happen in your case when it correct itself in allocate threads or when boinc reach priority stage to hold deadline.
The higher mix off application/projects got different deadlines and resource share would cause some issues at some point but not always an issue. When i used older version of virtualbox in Ubuntu it could handle around 30 vm machines start/stop concurrently but above that system crashed and boinc-client got panic all task would get corrupted.

One more question, if I set MT to 4 CPUs, will it select cores based on NUMA nodes?


I have low knowledge on this and only tested in windows server 2016 to try to learn about NUMA and effect to boinc and failed get any understanding. I have not seen any way to be able to select by NUMA nodes or divide load on NUMA or UMA and only experiment settings for this. As i get it Boinc-client would only get systeminfo in total of cores or HT/SMT threads available and get RAM in total sum.
This could be more specific to system and how kernel work set load and microcode it has how handle it would in normal workload.
22) Message boards : Number crunching : Issues Controlling Number of Threads Used (Message 42141)
Posted 12 Apr 2020 by Gunde
Post:
I don't have any experience of this TBar but i would suggest to use recommended client from https://boinc.berkeley.edu/download.php even if it would work right now. Changes that is done in boinc manager would create an override setting that would be specifically use for that host and not listen to other changes. Check amount of disk and ram is needed as these task at LHC set work in vm environment using virtualbox or use cvmfs. It may not be problem on low amount of task running but when scale it up to run LHC only it would hit a limit on host with low disk space or low ram.


Each task can use from 1 to 8 cores according to what is set for "Max # CPUs" in the LHC@Home project preferences. By default this is 1, so if you want to run multiple core tasks you should change this setting. You can also change the number of cores to use and other setting by using an app_config.xml file (recommended for experienced volunteers only).

The events to process will be split among the available cores, so normally each core processes (no events in task) / (no of cores). The processes share memory which means multicore tasks use less total memory than running the same number of single core tasks. The memory allocated to the virtual machine is calculated based on the number of cores following the formula: 3GB + 0.9GB * ncores.

It is recommended to have more than 4GB memory available to run ATLAS tasks. Even single-core tasks are not practical to run with 4GB or less.

Console 3 shows the processes currently running. A healthy WU (after the initialisation phase) should have N athena.py processes using close to 100% CPU, where N is the number of cores. You will sometimes see an extra athena.py process, but this is the "master" process which controls the "child" processes doing the actual simulation.

Read here for Atlas https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4178#29560
Theory is more lightweight and require 630mb only but Atlas and CMS is more demanding.

So looking at host Your EPYC would perfect and 1950X (64GB) would do great. But 2990WX (32GB) would suffer on ram, this host could not run task to reach full usage on cpu.

The issue that experience for theory is normal. These task can run from few minutes up several days and could end with error as vm abort jobs that can't be finished in reasonable time. To extend time higher succes native application with cvmfs is suggested.
CMS complex and very heavy on system and issues are mostly related to network issues but could be other issues also. Follow forum daily to get info from project to minimize failures.
Follow error code from stderr log and check in forum if other users have solution. In some cases you would get more info for error if add extension to virtualbox and open session while task is running.
You open session from boinc manager and terminal would pop up. Use Alt+F2 to reach tty for session that does job. You reach top with Alt+F3.

You would be able to install cvmfs on these systems that run 18.04 and lower the need of disk space and ram on systems.

Downloads https://cernvm.cern.ch/portal/filesystem/downloads
Setup https://cernvm.cern.ch/portal/filesystem/quickstart
Setup config https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4758 (Recommended. Atlas would work by default with config from cern.ch but work better with this info from thread)

Edit: System also need squashtools and python. Singularity (container for atlas) and runc (container for theory) is not needed. Singularity is an option now and recommended to not use on system.
23) Message boards : News : Theory application reaches 5 TRILLION events !! (Message 42122)
Posted 10 Apr 2020 by Gunde
Post:
Congrats team!

Looks like have done 130 billions or 2.6% out of these events.
24) Message boards : Number crunching : Notices: Needs More Disk Space (Message 42114)
Posted 9 Apr 2020 by Gunde
Post:
Boinc-client can't fetch more data then 100GB it is a hardcoded limit. Above this size it would blame on low disk space. Lower 100GB and let client run out work that is waiting and would clear out by it self.

If have other project that need high disk space you could setup a secondary data folder and run another client parallel. But remember to give it 50/50 share in cpu and ram to it if so.
25) Message boards : Theory Application : 100% computation errors (Message 42068)
Posted 6 Apr 2020 by Gunde
Post:
Ryzen 9

VBoxManage: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)


Ryzen 7 only got task for native application and it is not installed on machine.
26) Message boards : Number crunching : Self-Aborting TAsks (Message 42049)
Posted 5 Apr 2020 by Gunde
Post:
You might need to look up if you use correct account credentials same as you use here to post in forum. Check mail, user id and CPID on for project.

Your account have no host added yet and host would be there if it got same mail as you used before.
27) Message boards : Number crunching : Count Computer : The LHC CERN Game (Message 42048)
Posted 5 Apr 2020 by Gunde
Post:
HPC and data center that would setup put up system would use directly from source in build or what site cern.ch put out. They have a strict policy to follow to ensure nothing unknown could get in or open up against there system them or data that would use.

There is weakness to make mirrors of site or hand out script to users and this in level of these posted in forum on several post frequently. It would set users in risk and also the one that setup this. It mat not be an malicious on this but could be abuse on on MIT attack between both part. These are from untrusted sources of single owner not in part project itself and script that posted is hand out with packages that is not needed and specified on guide from cern.sh homepage and HPC have there on section for this.

is.gs redirect to a Gdrive for .sh . They would not allow any datacenter to get any .sh for install from external script and this script have several packages as pollinate, haveged and links to untrusted source https://entropy.n-helix.com.com. How secure are this SSL and to domain and security to Google cloud. Is it maintained enough and backed up.
This is not included for setup for SL and Cent OS posted from your end and never suggested from Cern documents https://cvmfs.readthedocs.io/en/stable/index.html

Helix and RS could be great blog on topic to post great info for HPC but mirror project site or hand out script or commands could give false safety for users that put trust on external sources.
If you would like to be a part to support directly to cern i suggest to get direct contact and share resources and follow the guidelines and policy for network they provide to cache sites or mirror files.

It is for all users, project and external supporters interest to get correct info and code. Cern have a great network inside and backed up with many proxy servers that have cloudflare and frontier.cern.ch.

For Boinc there is mainsite and github for users get packages or use trusted package providers.
28) Message boards : Sixtrack Application : 255 (0x000000FF) Unknown error code (Message 42042)
Posted 4 Apr 2020 by Gunde
Post:
Got several sixtrack task with this error and message:

<core_client_version>7.16.1</core_client_version>
<![CDATA[
<message>
process exited with code 255 (0xff, -1)</message>
<stderr_txt>
17:15:10 (11125): called boinc_finish(-1)
SIXTRACR> ERROR PROBLEMS RE-READING singletrackfile.dat for ia=30 IOSTAT=-1
          mybinrecs=25020 Expected crbinrecs=836
SIXTRACR> CRCHECK failure positioning binary files

</stderr_txt>
]]>

task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=270617486

A CRC check issue? What could we do about this?
29) Message boards : Number crunching : Count Computer : The LHC CERN Game (Message 42041)
Posted 4 Apr 2020 by Gunde
Post:
What is the is point in increase of host? Probably that new host got increased due to new batch of work to sixtrack. What type of game do you refer to.

If like to share install package I suggest to link to main site or github and avoid creating mirror or copy sites of project for stats install package and scripts.
These script hold unneeded package and i would prefer direct source from cernvm.cern.ch or lhc them self.

I get it you would like to marketing and promote your site but it starting to get spamming and more annoying then help users.

Please stop links that related to helix, RS, Gdrive or any script to hpc or unwanted packages.
30) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 41967)
Posted 20 Mar 2020 by Gunde
Post:
And 4 days would be a waste if it did not succeed in that time.

It is a game of users patience, when would we reach our the threshold of keep them running. It could be extended to never ending but users would not accept it.
If run a in native application i normally set 7 days no matter what the stage it would say. This would be a fictive number to not deal with these kind of jobs it choose to run in.

Running a script to abort known job that are doomed to fail is one way or add a blacklist is another way to deal with it.
Most user would probably abort on specific time reached and if it got get to common they would uncheck application.

My view is that it would be better if these would fall into Theory Beta or deal with them on separated project as LHC dev. They have a purpose for project but give out bad experience to whole Theory project.
Sherpa in known mostly to this but while range of these type of work have show up as endless jobs.
Some users would be open to opt-in to these jobs but able to choose on when and with which hardware. Most users would not monitor each task or host daily or weekly.
31) Questions and Answers : Unix/Linux : Cvmfs_config probe fails on archlinux (Message 41927)
Posted 16 Mar 2020 by Gunde
Post:
is autofs added and installed in package? https://wiki.archlinux.org/index.php/Autofs
Does it need squashtools?

Would would like to test in arch but no experience. Do we have a guide to install? Only found package at aur https://aur.archlinux.org/packages/cvmfs/
32) Message boards : ATLAS application : Low number of queued tasks (Message 41925)
Posted 16 Mar 2020 by Gunde
Post:
i see
33) Message boards : Number crunching : After windows update BOINC does not see all the CPU's (Message 41590)
Posted 14 Feb 2020 by Gunde
Post:
I would expect the cause is the change of application running to application or a change of your settings done it.
windows update to Windows C++ or Net Framework would not have any affect to or glitch to boinc-client. If got an issue to network or setup based on vm it would pull it on stderr log.

It simple and suggested to run on default and only make changes if is required to host. My bet is the experience to your boinc-client is that it changed in setting or prio task running and it had an affect to allow to run more threads then before.

To avoid it would happen again it would be to stay with default or learn what difference to project and application would do when combine them. Eventlog provide a lot info and track it down or you would experience same issue soon again. Your issue is similar in several threads if changes to app_config need knowledge and understanding if like to use it.
A combination to several other project and also if a gpu is in use it could be complex system to follow. The amount of fetched work and settings would have big affect and rules on code to boinc-client on deadline and how it priorities task is not easy.

A good practise is to leave it client to handle it but if the host got setup got an abnormal value it would behave soand need some adjustment.
34) Message boards : Number crunching : After windows update BOINC does not see all the CPU's (Message 41568)
Posted 14 Feb 2020 by Gunde
Post:
You could suspend LHC task to see if what is holding it up. Details info for your host 16 threads/processors and it should be correct it is same what boinc-client report in event log. A reinstall of client would not help if it already report correct amount.
The setup as it is now for your host you are able to do 2 GPU task (if default) and 1 or 2 atlas task.

2020-02-12 21:57:21 (11156): Setting Memory Size for VM. (14000MB)
2020-02-12 21:57:21 (11156): Setting CPU Count for VM. (4)

So if you have set to use 100% and 100%-90% of ram it should only run 2 gpus and 1 atlas. But if run CMS it could run 2 putask and 11 CMS task.

The good side is that your host got all Atlas task valid except canceled ones. What i could see on host a pattern show up and caused by high ram and reduced tasks concurrently running.

Boinc-client fetch a lot of atlas task as calculate that should be able to finished in time frame before deadline. But when host choose to start any atlas task it be hold up of lack of ram. This could be solved if started task from other application or project but as the host got plenty of atlas task it set focus on run these first. So boinc would be a zombie and waiting for ram and get high priority to atlas.

You could see that host hold task 3-4 days and non started task would be canceled when workunit got valid from other host.

Suggest to run default settings app_config or delete app_config completely. If you choose to keep app_config it would help if reduce ram then use <max_concurrent>N</max_concurrent> to make boinc-client to focus on other applications instead of waiting. Then you reduce max of fetched work on host. Amount of minimum 0.1 days and up to 0.2 work stored up. It is better that host get a fresh task when it need it then hold task for 3-4 days and not started.
Even if deadline on task would be far away other hosts would be able finished it and 1-4 days is probably time frame of these atlas. So why does the server send out same task to others and not wait until deadline is reached?
This still unknown to me server send out task that already have a "wingman" to early even if host actively running task. this could be brought up on another thread for discussion. As it is now Atlas only need 1 valid task to validate workunit and current setup keep production up. Some project require 2 valid task which reduce production to half.
If your task got canceled just ignore it or reduce amount of work stored on host. And also planned maintenance to computer set project to no new task.
35) Message boards : Theory Application : Theory Native issues with cgroups [SOLVED] (Message 41494)
Posted 7 Feb 2020 by Gunde
Post:
I have edit with
systemctl edit boinc-client.service

And added line into fstab
tmpfs  /sys/fs/cgroup  tmpfs  rw,nosuid,nodev,noexec,mode=755  0  0


Cgroup got mounted and boinc.client is added also freezer listed and got info to log and process and task:id listed in files to cgroup. It looks that part works.

Suspend/Resume
Runrivetlog pythia6 part of when suspend and resume
46600 events processed
46700 events processed
46800 events processed
46900 events processed
47000 events processed
dumping histograms...
47100 events processed
47200 events processed
47300 events processed
47400 events processed


Eventlog
19:52:13 (5580): wrapper (7.15.26016): starting
19:52:13 (5580): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.31 ()
19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Detected Theory App
19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking CVMFS.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking runc.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating the filesystem.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating cgroup for slot 74
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Updating config.json.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Running Container 'runc'.
19:52:19 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] ===> [runRivet] Fri Feb  7 18:52:18 UTC 2020 [boinc ppbar jets 1960 140 - pythia6 6.428 393 100000 22]
20:11:12 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Pausing container Theory_2363-878402-22_0.
20:13:26 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Resuming container Theory_2363-878402-22_0.


Work for application and wrapper resumed properly in short time task resume at last state.

But at shutdown of boinc service there is problem that process is uninterruptible. A start boinc service without systemctl would start second process tree with init 2 that is uninterruptible and not started.
Control Group /boinc/74 (blkio,cpu,cpuacct,cpuset,device,freezer,hugetlb,memory,net_cls,net_prio,perf_event.pids),/system.slice/boinc-client.service (,systemd)

At reboot it wiped runrivet.log and task also start from 0%.


My setup would probably not be able to handle a shutdown or set it wrong. I did a simple test and and it could probably be able to workaround to get it to save state process.
36) Message boards : Theory Application : Theory Native issues with cgroups [SOLVED] (Message 41493)
Posted 7 Feb 2020 by Gunde
Post:
Great work to investigate this issue Nethershaw. I appreciate it.

I jump in and test it. It should indeed be added into main guide.
37) Message boards : News : Server outage - uploads failing (Message 41346)
Posted 24 Jan 2020 by Gunde
Post:
Yes for this error it this wu got affected. I was on break at work and short of time and didn't digg to much on history of other wu's or history into this on. My conclusion would be server would be in bad shape to send data as put out http error.
It was example and if one failed it would be several more this wu's would not be the only one.

I opt-out of Atlas as soon i saw it and now turn it back when i got home and my hosts download just fine now.
38) Message boards : News : Server outage - uploads failing (Message 41342)
Posted 24 Jan 2020 by Gunde
Post:
Atlas

WU download error: couldn't get input files:
<file_xfer_error>
<file_name>VlLMDmUZ6EwnsSi4apGgGQJmABFKDmABFKDmvjyKDmABFKDmCNuF5m_EVNT.19652175._000455.pool.root.1</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>
39) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41283)
Posted 16 Jan 2020 by Gunde
Post:
Thanks for looking up SIF. It looks like SIF read-only but would tell much for singularity to atlas.

I had root full after trying to create new swapfile and after updates and reboot it stall. No rescue for my host so i did re-install of os and gave root and swap to higher amount with custom setup instead of default.
So just hope it solve it and deal with all other hosts.
40) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41282)
Posted 16 Jan 2020 by Gunde
Post:
Yes i use all 72 cores. But half of them are used to Atlas, running sixtrack and other projects. At highest counted i gone up at 70 GB for this system and had an issue only when rosetta took 4GB for each task and when yoyo ecm P2.
Right now when Atlas was running on this host it is around 50GB ram.

I notice that /dev/cl/root was full and on default it was set to 50 GiB so right now i will update and restart and see if clear up some space. If not i try change size on this or reinstall os.


Previous 20 · Next 20


©2020 CERN