1) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 41967)
Posted 8 days ago by Gunde
Post:
And 4 days would be a waste if it did not succeed in that time.

It is a game of users patience, when would we reach our the threshold of keep them running. It could be extended to never ending but users would not accept it.
If run a in native application i normally set 7 days no matter what the stage it would say. This would be a fictive number to not deal with these kind of jobs it choose to run in.

Running a script to abort known job that are doomed to fail is one way or add a blacklist is another way to deal with it.
Most user would probably abort on specific time reached and if it got get to common they would uncheck application.

My view is that it would be better if these would fall into Theory Beta or deal with them on separated project as LHC dev. They have a purpose for project but give out bad experience to whole Theory project.
Sherpa in known mostly to this but while range of these type of work have show up as endless jobs.
Some users would be open to opt-in to these jobs but able to choose on when and with which hardware. Most users would not monitor each task or host daily or weekly.
2) Questions and Answers : Unix/Linux : Cvmfs_config probe fails on archlinux (Message 41927)
Posted 12 days ago by Gunde
Post:
is autofs added and installed in package? https://wiki.archlinux.org/index.php/Autofs
Does it need squashtools?

Would would like to test in arch but no experience. Do we have a guide to install? Only found package at aur https://aur.archlinux.org/packages/cvmfs/
3) Message boards : ATLAS application : Low number of queued tasks (Message 41925)
Posted 12 days ago by Gunde
Post:
i see
4) Message boards : Number crunching : After windows update BOINC does not see all the CPU's (Message 41590)
Posted 14 Feb 2020 by Gunde
Post:
I would expect the cause is the change of application running to application or a change of your settings done it.
windows update to Windows C++ or Net Framework would not have any affect to or glitch to boinc-client. If got an issue to network or setup based on vm it would pull it on stderr log.

It simple and suggested to run on default and only make changes if is required to host. My bet is the experience to your boinc-client is that it changed in setting or prio task running and it had an affect to allow to run more threads then before.

To avoid it would happen again it would be to stay with default or learn what difference to project and application would do when combine them. Eventlog provide a lot info and track it down or you would experience same issue soon again. Your issue is similar in several threads if changes to app_config need knowledge and understanding if like to use it.
A combination to several other project and also if a gpu is in use it could be complex system to follow. The amount of fetched work and settings would have big affect and rules on code to boinc-client on deadline and how it priorities task is not easy.

A good practise is to leave it client to handle it but if the host got setup got an abnormal value it would behave soand need some adjustment.
5) Message boards : Number crunching : After windows update BOINC does not see all the CPU's (Message 41568)
Posted 14 Feb 2020 by Gunde
Post:
You could suspend LHC task to see if what is holding it up. Details info for your host 16 threads/processors and it should be correct it is same what boinc-client report in event log. A reinstall of client would not help if it already report correct amount.
The setup as it is now for your host you are able to do 2 GPU task (if default) and 1 or 2 atlas task.

2020-02-12 21:57:21 (11156): Setting Memory Size for VM. (14000MB)
2020-02-12 21:57:21 (11156): Setting CPU Count for VM. (4)

So if you have set to use 100% and 100%-90% of ram it should only run 2 gpus and 1 atlas. But if run CMS it could run 2 putask and 11 CMS task.

The good side is that your host got all Atlas task valid except canceled ones. What i could see on host a pattern show up and caused by high ram and reduced tasks concurrently running.

Boinc-client fetch a lot of atlas task as calculate that should be able to finished in time frame before deadline. But when host choose to start any atlas task it be hold up of lack of ram. This could be solved if started task from other application or project but as the host got plenty of atlas task it set focus on run these first. So boinc would be a zombie and waiting for ram and get high priority to atlas.

You could see that host hold task 3-4 days and non started task would be canceled when workunit got valid from other host.

Suggest to run default settings app_config or delete app_config completely. If you choose to keep app_config it would help if reduce ram then use <max_concurrent>N</max_concurrent> to make boinc-client to focus on other applications instead of waiting. Then you reduce max of fetched work on host. Amount of minimum 0.1 days and up to 0.2 work stored up. It is better that host get a fresh task when it need it then hold task for 3-4 days and not started.
Even if deadline on task would be far away other hosts would be able finished it and 1-4 days is probably time frame of these atlas. So why does the server send out same task to others and not wait until deadline is reached?
This still unknown to me server send out task that already have a "wingman" to early even if host actively running task. this could be brought up on another thread for discussion. As it is now Atlas only need 1 valid task to validate workunit and current setup keep production up. Some project require 2 valid task which reduce production to half.
If your task got canceled just ignore it or reduce amount of work stored on host. And also planned maintenance to computer set project to no new task.
6) Message boards : Theory Application : Theory Native issues with cgroups [SOLVED] (Message 41494)
Posted 7 Feb 2020 by Gunde
Post:
I have edit with
systemctl edit boinc-client.service

And added line into fstab
tmpfs  /sys/fs/cgroup  tmpfs  rw,nosuid,nodev,noexec,mode=755  0  0


Cgroup got mounted and boinc.client is added also freezer listed and got info to log and process and task:id listed in files to cgroup. It looks that part works.

Suspend/Resume
Runrivetlog pythia6 part of when suspend and resume
46600 events processed
46700 events processed
46800 events processed
46900 events processed
47000 events processed
dumping histograms...
47100 events processed
47200 events processed
47300 events processed
47400 events processed


Eventlog
19:52:13 (5580): wrapper (7.15.26016): starting
19:52:13 (5580): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.31 ()
19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Detected Theory App
19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking CVMFS.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking runc.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating the filesystem.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating cgroup for slot 74
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Updating config.json.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Running Container 'runc'.
19:52:19 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] ===> [runRivet] Fri Feb  7 18:52:18 UTC 2020 [boinc ppbar jets 1960 140 - pythia6 6.428 393 100000 22]
20:11:12 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Pausing container Theory_2363-878402-22_0.
20:13:26 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Resuming container Theory_2363-878402-22_0.


Work for application and wrapper resumed properly in short time task resume at last state.

But at shutdown of boinc service there is problem that process is uninterruptible. A start boinc service without systemctl would start second process tree with init 2 that is uninterruptible and not started.
Control Group /boinc/74 (blkio,cpu,cpuacct,cpuset,device,freezer,hugetlb,memory,net_cls,net_prio,perf_event.pids),/system.slice/boinc-client.service (,systemd)

At reboot it wiped runrivet.log and task also start from 0%.


My setup would probably not be able to handle a shutdown or set it wrong. I did a simple test and and it could probably be able to workaround to get it to save state process.
7) Message boards : Theory Application : Theory Native issues with cgroups [SOLVED] (Message 41493)
Posted 7 Feb 2020 by Gunde
Post:
Great work to investigate this issue Nethershaw. I appreciate it.

I jump in and test it. It should indeed be added into main guide.
8) Message boards : News : Server outage - uploads failing (Message 41346)
Posted 24 Jan 2020 by Gunde
Post:
Yes for this error it this wu got affected. I was on break at work and short of time and didn't digg to much on history of other wu's or history into this on. My conclusion would be server would be in bad shape to send data as put out http error.
It was example and if one failed it would be several more this wu's would not be the only one.

I opt-out of Atlas as soon i saw it and now turn it back when i got home and my hosts download just fine now.
9) Message boards : News : Server outage - uploads failing (Message 41342)
Posted 24 Jan 2020 by Gunde
Post:
Atlas

WU download error: couldn't get input files:
<file_xfer_error>
<file_name>VlLMDmUZ6EwnsSi4apGgGQJmABFKDmABFKDmvjyKDmABFKDmCNuF5m_EVNT.19652175._000455.pool.root.1</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>
10) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41283)
Posted 16 Jan 2020 by Gunde
Post:
Thanks for looking up SIF. It looks like SIF read-only but would tell much for singularity to atlas.

I had root full after trying to create new swapfile and after updates and reboot it stall. No rescue for my host so i did re-install of os and gave root and swap to higher amount with custom setup instead of default.
So just hope it solve it and deal with all other hosts.
11) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41282)
Posted 16 Jan 2020 by Gunde
Post:
Yes i use all 72 cores. But half of them are used to Atlas, running sixtrack and other projects. At highest counted i gone up at 70 GB for this system and had an issue only when rosetta took 4GB for each task and when yoyo ecm P2.
Right now when Atlas was running on this host it is around 50GB ram.

I notice that /dev/cl/root was full and on default it was set to 50 GiB so right now i will update and restart and see if clear up some space. If not i try change size on this or reinstall os.
12) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41279)
Posted 16 Jan 2020 by Gunde
Post:
Anyone have solution for this?

2020-01-16 14:28:22,074: Checking singularity with cmd:/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname
2020-01-16 14:28:23,158: Singularity isnt working: INFO:    Convert SIF file to sandbox...
FATAL:   while extracting /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img: root filesystem extraction failed: failed to copy content in staging file: write /tmp/archive-846395633: no space left on device


Some times host getting trouble and this happen before and solve it self after a while. One of host happens to be effected more then other but setup is done same way for all host.
This host use default value in config. Would it help to increase cache? Or any other parameter that could help to adjust?.

Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259466119

Device info storage
Cache	46080 KB
Swap space	4 GB
Total disk space	410.56 GB
Free Disk Space	393.62 GB


Another host same issue:
Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=258968040

Could give it a try to resize swap.
13) Message boards : ATLAS application : Another crappy task (Message 41152)
Posted 4 Jan 2020 by Gunde
Post:
If that task had old setting of low ram it would be doomed to fail from start. Get settings right and app_config and after that allow new task. Only new task would be able to be valid.

If like to reach any stable environment i suggest to run on linux and able to do native instead.
14) Message boards : ATLAS application : Another crappy task (Message 41149)
Posted 3 Jan 2020 by Gunde
Post:
Your task have most likely stalled out. Avoid other processes like FAH, the load you would see is not correct measurement when several other process is running. Cpu could easy suffer if doesn't get core/treads that is reserved.
This include I/O on disk and ram. To startup a vm machine it would need higher then set to boinc as a process on start/stop/save would not be counted and be system load.

On specific task boinc only get 2241MB far to low as old atlas require 2600MB for 1 core and new application recommend 3000MB for one core. Task in this case suffer on start to boot and get script running. you would not see any proccess of ATLAS.py running as can't get any memory to even start. Those few sec are probably the attempt on start atlas and stalled out.

2019-12-30 16:36:50 (12132): Setting Memory Size for VM. (2241MB)
2019-12-30 16:36:50 (12132): Setting CPU Count for VM. (4)

Each core added would need somewhere around 800MB-1000MB each.

If you would like to use app_config i suggest to not include ram setting and let application pick what it would need. Changes to ram would only work to new downloaded task and if you update boinc manager it would only change corecount.

Virtualbox have it's own issues and error and mix with LHC it could be hard to catch what problem could be but LHC have put great log and extension could pull out a lot of good info. The suggestion that 4 core app_config is good and i got better experience running on 4 then default 12 on virtualbox. I got less (error 195) using app_config.xml. But running on virtualbox i had to use 8000MB as minimum for 4 core task to new application to have it somewhat stable running. So running default on 12 core with 11000MB or what it would require default is probably better for most users as ramusage would be lower and less load on disk and cpu and ram.

Task process in boinc-manager is just wrapper fetch info on vm machine any estimated time is only based on device flops what your cpu could/should or be able to do in time. If flops calculation is of target what cpu is estimated could be days off.
Never ever trust estimated on first batch of task your boinc manager download or when you make changes to app_config.xml.

If like any info estimated time it would be one in console of each vm machine task. it would provide a much better but not perfect time errors and load.
15) Message boards : ATLAS application : VBOX error: KBD: unsupported int 16h function 03 (Message 41147)
Posted 3 Jan 2020 by Gunde
Post:
Int 16h function 03 is just an attempt to set the typematic rate (repeat rate) of the keyboard

https://forums.virtualbox.org/viewtopic.php?f=6&t=88627

Issue to 256981269 task is:

upload failure: <file_xfer_error>
<file_name>5VuMDmbye5vnsSi4apGgGQJmABFKDmABFKDmC4tPDmABFKDmvVzvyn_0_r968247860_ATLAS_result</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>


Task 256978465 second task failed on memory. Only 2241MB is dedicated to 4 core task, this looks to low.

VBoxManage.exe: error: Failed to load unit 'pgm' (VERR_EM_NO_MEMORY)
VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole
2019-12-30 15:32:24 (4348): VM failed to start.
2019-12-30 15:32:24 (4348): Could not start
2019-12-30 15:32:24 (4348): ERROR: VM failed to start
16) Message boards : Theory Application : 1 (0x00000001) error on Theory Simulation v300.02 (vbox64_theory) (Message 41113)
Posted 29 Dec 2019 by Gunde
Post:
2019-12-29 12:52:36 (8476): Guest Log: 12:52:49 CST +08:00 2019-12-29: cranky: [ERROR] Container 'runc' terminated with status code 1.
and
<message>
Incorrect function.
(0x1) - exit code 1 (0x1)</message>
no info provided on this error and got same on a few task on linux native. So there something with runc that fail sometimes. Could be permissions issues or libs that did not properly.

runc is a container process and it failed with error code. Would need to check list for runc to know if vm did not post in vm machine. as maeax say get extensionpack to virtualbox to open vm while running.
some error are posted there.

If it continue to fail on all task it could be virtualbox or setting to boinc. I notice that you have set boinc to use 70% of cpu time. It would make task throttle and it bad for vm machine if got interrupted while running.
If you like to reduce usage or heat on cpu lower "use at most of X% of CPUs" instead and it use less cores/threads.
17) Message boards : ATLAS application : lcgft-atlas.gridpp.rl.ac.uk down (Message 41086)
Posted 26 Dec 2019 by Gunde
Post:
Sent a pm to him about this issue.

In middle of retry i got message to try yum-config-manager --disable cernvm or subscription-manager repos --disable=cernvm but could not continue after using them. I will try another day.
18) Message boards : ATLAS application : lcgft-atlas.gridpp.rl.ac.uk down (Message 41084)
Posted 26 Dec 2019 by Gunde
Post:
http://cvmrepo.web.cern.ch/ is down
it would timeout and try mirror but repeat with same ipv6 address.

Could anyone report to CERN IT team?
19) Message boards : Cafe LHC : Happy Xmas LHC@home! (Message 41036)
Posted 22 Dec 2019 by Gunde
Post:
20) Message boards : Number crunching : Only getting Theory app tasks (Message 41032)
Posted 21 Dec 2019 by Gunde
Post:
Ok you have selected to only run 1 then the job cache is not a problem here.

You probably happy to run any task but to test if would be able to run sixtrack application could you uncheck theory and check event log when it update.
Server could be empty at time it update or host do not get application.

You use Darwin 19.2.0 and maybe it would be an issue to application. I have no experience to Mac but it could help project admins to see if host is prevented to get task or fail to run.


Next 20


©2020 CERN