41) Message boards : Sixtrack Application : EXIT_DISK_LIMIT_EXCEEDED (Message 43337)
Posted 11 Sep 2020 by Greger
Post:
It looks like they re-new work units that got to many fails.

It is hard to locate issue as some host could do some task but fail on other. It appears on on every application and system and looks sporadically. It is deep down on work set that it create an issue at some point and when it does size increase fast.
It looks like it needs to runt out to end it succeed and not fall in to this trap.

To avoid it it could be beta may not be correctly investigated for work set to application and rate of failures may not detected in time for it.

It would need to corrected in batch or find a temporarily solution as workaround as increase file size <rsc_disk_bound> which may not be a great as it could require insane amount to succeed. Somewhere in computing it does get wrong and require more deep investigation and watchdog or debug may not help to solve it.
42) Message boards : ATLAS application : Confused (Message 43307)
Posted 4 Sep 2020 by Greger
Post:
Would like to point out "Cancelled by server" is common factor on many project and it would only affect task not started before valid task is done to complete workunits requirement of total valid.
This is a way to reduce waste of computation that is not needed and your client would be happy to fetch other task.
This happen every day to me for my host on other projects.

This would happen if other host have short "Average turnaround time" while others host have long time. If you have set Network to be active to "always" and have decent bandwidth try to reduce it. To do that reduce in settings at computing -> "Store up to an additional x days of work. Would suggest it to max 1 day. Your host report around 3 days and many task would be cancelled as it not needed and waste bandwidth and storage on your host.
If client handle several project you could do lower as others projects would work as backup project if servers is down or empty.
43) Message boards : Sixtrack Application : EXIT_DISK_LIMIT_EXCEEDED (Message 43298)
Posted 29 Aug 2020 by Greger
Post:
Another with different application (darwin):

Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED


<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded</message>
<stderr_txt>

Crashed executable name: sixtrack_darwin_50205_avx.exe
Machine type Intel x86-64h Haswell (64-bit executable)
System version: Macintosh OS 10.15.6 build 19G2021
Thu Aug 27 18:21:37 2020

atos cannot load symbols for the file sixtrack_darwin_50205_avx.exe for architecture x86_64.
0   sixtrack_darwin_50205_avx.exe       0x0000000108adb1fe  

Thread 1 crashed with X86 Thread State (64-bit):
  rax: 0x0100001f  rbx: 0x00000003  rcx: 0x700000b93738  rdx: 0x00000028
  rdi: 0x700000b937a8  rsi: 0x00000003  rbp: 0x700000b93790  rsp: 0x700000b93738
   r8: 0x00001003   r9: 0x00000000  r10: 0x000009c8  r11: 0x00000206
  r12: 0x00000003  r13: 0x000009c8  r14: 0x700000b937a8  r15: 0x00000028
  rip: 0x7fff673f9dfa  rfl: 0x00000206

Binary Images Description:
       0x1087db000 -        0x108bd0fff /Library/Application Support/BOINC Data/slots/1/../../projects/lhcathome.cern.ch_lhcathome/sixtrack_darwin_50205_avx.exe
    0x7fff6429d000 -     0x7fff6429efff /usr/lib/libSystem.B.dylib
    0x7fff64583000 -     0x7fff645d5fff /usr/lib/libc++.1.dylib
    0x7fff645d6000 -     0x7fff645ebfff /usr/lib/libc++abi.dylib
    0x7fff660fd000 -     0x7fff66130fff /usr/lib/libobjc.A.dylib
    0x7fff6709a000 -     0x7fff6709ffff /usr/lib/system/libcache.dylib
    0x7fff670a0000 -     0x7fff670abfff /usr/lib/system/libcommonCrypto.dylib
    0x7fff670ac000 -     0x7fff670b3fff /usr/lib/system/libcompiler_rt.dylib
    0x7fff670b4000 -     0x7fff670bdfff /usr/lib/system/libcopyfile.dylib
    0x7fff670be000 -     0x7fff67150fff /usr/lib/system/libcorecrypto.dylib
    0x7fff6725d000 -     0x7fff6729dfff /usr/lib/system/libdispatch.dylib
    0x7fff6729e000 -     0x7fff672d4fff /usr/lib/system/libdyld.dylib
    0x7fff672d5000 -     0x7fff672d5fff /usr/lib/system/libkeymgr.dylib
    0x7fff672e3000 -     0x7fff672e3fff /usr/lib/system/liblaunch.dylib
    0x7fff672e4000 -     0x7fff672e9fff /usr/lib/system/libmacho.dylib
    0x7fff672ea000 -     0x7fff672ecfff /usr/lib/system/libquarantine.dylib
    0x7fff672ed000 -     0x7fff672eefff /usr/lib/system/libremovefile.dylib
    0x7fff672ef000 -     0x7fff67306fff /usr/lib/system/libsystem_asl.dylib
    0x7fff67307000 -     0x7fff67307fff /usr/lib/system/libsystem_blocks.dylib
    0x7fff67308000 -     0x7fff6738ffff /usr/lib/system/libsystem_c.dylib
    0x7fff67390000 -     0x7fff67393fff /usr/lib/system/libsystem_configuration.dylib
    0x7fff67394000 -     0x7fff67397fff /usr/lib/system/libsystem_coreservices.dylib
    0x7fff67398000 -     0x7fff673a0fff /usr/lib/system/libsystem_darwin.dylib
    0x7fff673a1000 -     0x7fff673a8fff /usr/lib/system/libsystem_dnssd.dylib
    0x7fff673a9000 -     0x7fff673aafff /usr/lib/system/libsystem_featureflags.dylib
    0x7fff673ab000 -     0x7fff673f8fff /usr/lib/system/libsystem_info.dylib
    0x7fff673f9000 -     0x7fff67425fff /usr/lib/system/libsystem_kernel.dylib
    0x7fff67426000 -     0x7fff6746dfff /usr/lib/system/libsystem_m.dylib
    0x7fff6746e000 -     0x7fff67495fff /usr/lib/system/libsystem_malloc.dylib
    0x7fff67496000 -     0x7fff674a3fff /usr/lib/system/libsystem_networkextension.dylib
    0x7fff674a4000 -     0x7fff674adfff /usr/lib/system/libsystem_notify.dylib
    0x7fff674ae000 -     0x7fff674b6fff /usr/lib/system/libsystem_platform.dylib
    0x7fff674b7000 -     0x7fff674c1fff /usr/lib/system/libsystem_pthread.dylib
    0x7fff674c2000 -     0x7fff674c6fff /usr/lib/system/libsystem_sandbox.dylib
    0x7fff674c7000 -     0x7fff674c9fff /usr/lib/system/libsystem_secinit.dylib
    0x7fff674ca000 -     0x7fff674d1fff /usr/lib/system/libsystem_symptoms.dylib
    0x7fff674d2000 -     0x7fff674e8fff /usr/lib/system/libsystem_trace.dylib
    0x7fff674ea000 -     0x7fff674effff /usr/lib/system/libunwind.dylib
    0x7fff674f0000 -     0x7fff67525fff /usr/lib/system/libxpc.dylib


</stderr_txt>
]]>


Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=281834447
44) Message boards : Sixtrack Application : EXIT_DISK_LIMIT_EXCEEDED (Message 43296)
Posted 28 Aug 2020 by Greger
Post:
So if file could split or rotate it could work?

Found debug to one these tasks if that could help to analyse from:

<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x00007ffcd88c9212

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 7.14.2


Dump Timestamp    : 08/28/20 10:14:30
Install Directory : D:\
Data Directory    : D:\ProgramData
Project Symstore  : 
LoadLibraryA( D:\ProgramData\dbghelp.dll ): GetLastError = 126
Loaded Library    : dbghelp.dll
LoadLibraryA( D:\ProgramData\symsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( D:\ProgramData\srcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
LoadLibraryA( D:\ProgramData\version.dll ): GetLastError = 126
Loaded Library    : version.dll
Debugger Engine   : 4.0.5.0
Symbol Search Path: D:\ProgramData\slots\42;D:\ProgramData\projects\lhcathome.cern.ch_lhcathome


ModLoad: 0000000000400000 0000000002b73000 D:\ProgramData\projects\lhcathome.cern.ch_lhcathome\sixtrack_win64_50205_avx.exe (-nosymbols- Symbols Loaded)

ModLoad: 00000000db090000 00000000001f5000 C:\Windows\SYSTEM32\ntdll.dll (10.0.19041.423) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000d9440000 00000000000bd000 C:\Windows\System32\KERNEL32.DLL (10.0.19041.292) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.329 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.329

ModLoad: 00000000d8800000 00000000002c7000 C:\Windows\System32\KERNELBASE.dll (10.0.19041.423) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.388 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.388

ModLoad: 00000000daee0000 00000000000aa000 C:\Windows\System32\ADVAPI32.dll (10.0.19041.1) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000d9990000 000000000009e000 C:\Windows\System32\msvcrt.dll (7.0.19041.1) (-exported- Symbols Loaded)
    File Version          : 7.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 7.0.19041.1

ModLoad: 00000000d9a30000 000000000009b000 C:\Windows\System32\sechost.dll (10.0.19041.388) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000d9ae0000 0000000000123000 C:\Windows\System32\RPCRT4.dll (10.0.19041.1) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000d9e70000 0000000000740000 C:\Windows\System32\SHELL32.dll (10.0.19041.423) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.329 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.329

ModLoad: 00000000d8ed0000 000000000009d000 C:\Windows\System32\msvcp_win.dll (10.0.19041.423) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.423 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.423

ModLoad: 00000000d8ad0000 0000000000100000 C:\Windows\System32\ucrtbase.dll (10.0.19041.423) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.423 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.423

ModLoad: 00000000d90d0000 00000000001a0000 C:\Windows\System32\USER32.dll (10.0.19041.388) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000d8d30000 0000000000022000 C:\Windows\System32\win32u.dll (10.0.19041.450) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.450 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.450

ModLoad: 00000000da5b0000 000000000002a000 C:\Windows\System32\GDI32.dll (10.0.19041.1) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000d8d60000 000000000010a000 C:\Windows\System32\gdi32full.dll (10.0.19041.388) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.388 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.388

ModLoad: 00000000d92d0000 0000000000030000 C:\Windows\System32\IMM32.DLL (10.0.19041.1) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000bfa30000 00000000001e4000 C:\Windows\SYSTEM32\dbghelp.dll (10.0.19041.1) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000cffe0000 000000000000a000 C:\Windows\SYSTEM32\version.dll (10.0.19041.1) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000d8c80000 000000000007f000 C:\Windows\System32\bcryptPrimitives.dll (10.0.19041.264) (-exported- Symbols Loaded)
    File Version          : 10.0.19041.264 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.264



*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 31162, Write: 113555, Other 1647

- I/O Transfers Counters -
Read: 254908115, Write: 20975491, Other 92440

- Paged Pool Usage -
QuotaPagedPoolUsage: 8560, QuotaPeakPagedPoolUsage: 190480
QuotaNonPagedPoolUsage: 70242304, QuotaPeakNonPagedPoolUsage: 8016

- Virtual Memory Usage -
VirtualSize: 41771, PeakVirtualSize: 191438848

- Pagefile Usage -
PagefileUsage: 71057408, PeakPagefileUsage: 70242304

- Working Set Size -
WorkingSetSize: 190640, PeakWorkingSetSize: 40697856, PageFaultCount: 40701952

*** Dump of thread ID 1964 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 156250.000000, User Time: 0.000000, Wait Time: 5488261.000000

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x00007ffcd88c9212

- Registers -
rax=0000000000000000 rbx=000000000570eea0 rcx=000000000087466b rdx=000000000087466a rsi=00000000d94649f0 rdi=0000000000000001
r8=000000000570eea0 r9=0000000000000001 r10=0000000000874662 r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000 rip=00000000d88c9212 rsp=000000000570e6c8 rbp=0000000000000000
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

- Callstack -
ChildEBP RetAddr  Args to Child
0570e6c0 006e3fa4 0570eea0 0087466a 0570eea0 04e9a83e KERNELBASE!DebugBreak+0x0 
0570fef0 006e415c 00000000 00000000 00000000 00000000 sixtrack_win64_50205_avx!+0x0 
0570ff20 d9456fd4 00000000 00000000 00000000 00000000 sixtrack_win64_50205_avx!+0x0 
0570ff50 db0dcec1 00000000 00000000 00000000 00000000 KERNEL32!BaseThreadInitThunk+0x0 
0570ffd0 00000000 00000000 00000000 00000000 00000000 ntdll!RtlUserThreadStart+0x0 

*** Dump of thread ID 14784 (state: Running): ***

- Information -
Status: Base Priority: Above Normal, Priority: Above Normal, , Kernel Time: 8281250.000000, User Time: 2802500096.000000, Wait Time: 5488261.000000

- Registers -
rax=0000000000000000 rbx=0000000000000032 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=000000000000002c
r8=0000000000000000 r9=000000000586e608 r10=000000000586e818 r11=0000000000000000 r12=000000000000002a r13=00000000058704cc
r14=0000000000000033 r15=000000000000002d rip=00000000004132e3 rsp=000000000317ea30 rbp=000000000129b550
cs=0033  ss=002b  ds=0000  es=0000  fs=0000  gs=0000             efl=00000246

- Callstack -
ChildEBP RetAddr  Args to Child
0317f1d0 00448abc 00000065 05ad48c0 0084a208 00000000 sixtrack_win64_50205_avx!+0x0 
0317f540 0040b59c 00848546 00bdc360 01010101 0000000e sixtrack_win64_50205_avx!+0x0 
0317fde0 0083ce5a 00000001 001713b0 001713d0 00000042 sixtrack_win64_50205_avx!+0x0 
0317fe20 004013a5 00000000 00000042 02a9b890 00000000 sixtrack_win64_50205_avx!+0x0 
0317fef0 0040150b 00000000 00000000 00000000 00000000 sixtrack_win64_50205_avx!+0x0 
0317ff20 d9456fd4 00000000 00000000 00000000 00000000 sixtrack_win64_50205_avx!+0x0 
0317ff50 db0dcec1 00000000 00000000 00000000 00000000 KERNEL32!BaseThreadInitThunk+0x0 
0317ffd0 00000000 00000000 00000000 00000000 00000000 ntdll!RtlUserThreadStart+0x0 


*** Debug Message Dump ****


*** Foreground Window Data ***
    Window Name      : 
    Window Class     : 
    Window Process ID: 0
    Window Thread ID : 0

Exiting...

</stderr_txt>
]]>


Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=281959449

Does all application for different systems os and instruction set have same set of parameters in init Just thinking if they could get different settings. some hosts got none failed while others failed on longer runs.
45) Message boards : Sixtrack Application : EXIT_DISK_LIMIT_EXCEEDED (Message 43287)
Posted 27 Aug 2020 by Greger
Post:
Check slot folder it show rsc bound with 200 MB and folder in total at around 1 hour in and running is at around 190 MB.

<rsc_disk_bound>200000000.000000</rsc_disk_bound>


I can not see any .xml file for sixtrack so i can not increase that value for new task.

few minutes later....

It reached death at 205.2 MB mark.
46) Message boards : Sixtrack Application : EXIT_DISK_LIMIT_EXCEEDED (Message 43286)
Posted 27 Aug 2020 by Greger
Post:
This issue have come up again with current batch of sixtrack.

Example https://lhcathome.cern.ch/lhcathome/result.php?resultid=281854540

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded</message>
<stderr_txt>

</stderr_txt>
]]>


If i remember it correct Is it these lines in cc_config.xml to allow bigger files?

<max_stderr_file_size></max_stderr_file_size>
<max_stdout_file_size></max_stdout_file_size>


Edit: Added it with high value re-check files and restarted client but they keep error out at same time/size so i am probably wrong.

Could someone verify the issue and if occurs to more then my hosts. If this issue remain could we abort these work units?
47) Message boards : ATLAS application : Processor Time Locks Up Elapsed Time Continues to Climb (Message 43181)
Posted 4 Aug 2020 by Greger
Post:
Do you extensions pack added to VirtualBox?
If you got that click on task in boinc manager the on left bar look for "Show VM Console". It could be greyed out, if so no session is open and stuck.

But it is click able a terminal should open an prompt screen for login show up. If any critical error occurred it mostly post in there if not hit alt+F2 to view job screen. This best view to see if it is running any 'events' and get issues would appear if any. Could also get top (system monitor) using alt+F3.

If task failed info would show in stderr but in some cases you could more info from console.

In this thread you could see how looks inside console: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=29359#29359
(Yeti's checklist)

For issues regarding theory you would found common errors in section Theory at https://lhcathome.cern.ch/lhcathome/forum_forum.php?id=89

We would need screen from console or stderr log to find any issue.
48) Message boards : ATLAS application : Processor Time Locks Up Elapsed Time Continues to Climb (Message 43174)
Posted 3 Aug 2020 by Greger
Post:
They can run for days....

Just check that cputime is close to runtime.
49) Message boards : Theory Application : 300.06 Theory Simulation (native_theory) 0.385C (Message 43165)
Posted 1 Aug 2020 by Greger
Post:
Boinc probably get wrong value from config to theory and new wrapper could start if several are stated with lower value then full core.
I have experience before that theory had low usage it started a new task. Happen a few times when theory was mt task, But not seen that got that value in boinc manager. It could possibly be able to continue but could break in long run.

A restart of boinc service could change state but as run native that is not good.

Task would start a single pid and add up more to until it feed core it could be several process pending or stalled or python script is busy. You could probably monitor task running or check pstree or possibly get something in runrivet.log

Could probably be both.
50) Message boards : Number crunching : VM Applications Errors (Message 43164)
Posted 1 Aug 2020 by Greger
Post:
For personal view am against this "Sprint" it specifically this "3 days event" and cause a more trouble then project would gain from it. I made post at Cosmology for it and other projects that i contribute to such as TN-Grid was effected hard from this short "flow" of host. There are several negative aspect that i see and in my view it become brute force and same affect as DOS attack to web server and boinc servers. It no longer used as way for project admins to test network and boinc servers. It is become place for volunteers that is not satisfied to contribute in long term and take power in large crowd to attack project to prove something that already been proven. We have seen it several times for several years and it continue over and over again for same projects without any benefit. Website and boinc servers with no response for users and task flow that disappear and been dropped few days later. This a pain for admin to project to deal with and put a lot of time to regenerate new work units and purge data to db and spend time change parameters in feeder/scheduler To limit flow to host and they increase minimum quorum or increase size of wu like TN-Grid does after each event and become habit each year.

User post on site, forum or chat not aware of this event and default answer to have been "It's Formula Sprint" and most of time they posted things like " oh ok i try another project" or they post something that would drop it and do something else with computer.
If you like to be a part of "Sprint" it would probably best to go for sixstrack application only but i would like to encourage users NOT to in this manner. I know most of users would like to take control moving one project after another and play with these computers as they put lot of time and money to it. You don't need make it as a race hunt highest credit to get top on scoreboard. Be wise and stay at project to be able to make computers more work more efficiently at LHC. You can you do it, just need to take the time doing it.
You could do better an profit doing would be better experience to new things. My credit score on stat sites to project or teamranks does not give anything at end but knowledge around LHC giving me have great value. I learned several distributions of linux and application like virtualbox,cernvm-fs, run, singularity and squid. Basic stuff to bash help me today doing simple stuff at home. Such applications and network for proxy was completely unknown for me. No way i take time and get help using it without LHC push me and community here.

There is huge need of support on every project forum that need your experience to boinc and info for project. Help them and build up a healthy community around it. I would for sure try help other if i can.

Two people in this thread that do great support and knowledge they build up and share frequently to forum almost daily. They are doing great work and that is how get experience to move project forward by inform users to make contribution better and keep users that in there end help project in another way.

Example1: 'Jim1348' is the guy that made me try native application as he did a great post with commands to install it.It was great push to take step.
Example2: 'computezrmle' posted config for squid proxy and that was not requested but put great time and effort to make it with experience in network and time setup connection to additional cache proxy could not be done this easy without his work. And support he put in PM pin out network issues and with suggestions saved much time for me and improve my contribution to project. His the guy for sure to listen to when there are issues to task. His position is well earned.

I would not be here and doing native task to project or setup squid to improve network. Contribution this have huge affect on users and that makk

Or contribute in testing applications or improve them if you can. Or go to github and point out issues and pin out annoying things to boinc site or bad settings to boinc servers.
Or just be passive and enjoy experience you have and addiction to boinc and chill out and share pictures of rigs. Anything actively would be good for project in good way or be passive to enjoy time as volunteer and build up experience about it. If project reach out users listen to and be a part to help them and if they have some requirement try and be a part to improve it.

I am in same situation as you other volunteer. I have computers in every corner in house take up space generate excessive heat in every room 27/7. I hit 40 celsius some inside while it is 30 outside like today 's heat and pay high bills to keep them alive. There are good read at forums and at discord that really are addicted to boinc and spend many years since boinc started and they have done great journey. Many of them take time and post daily in forum to help other or post info to project admin to help them. Looking up to people that MAGIC Quantum Mechanic that has been has since 2004 and never complained even though he sits with limited data and low speed but equally damn contributes to the project when he can. Not only that, a journey through Atlas-dev and vLHC.
If there are always opportunities for improvement and perseverance and commitment, it will get better one day. It is important to convey wishes for this in a good way. If you can, it is good to participate to test and share info to solve the problems in a good way.

My suggestion is to pick project that fit you and stick to to until you found anything better. This like Covid-19 hit us and great to see dedicated volunteers move to these project when it was needed.

Which way you take is up to you and both parties would be here. I would encourage users a healthy journey and dig into projects and find a way to contribute or testing or debug issues if they can. Put your computer in way it would handle task one after another and monitor them. As soon as you found an issue get on it right away instead of put 100 computers directly to project and hope for the best. It is near impossible to help users with these computers.

For LHC there are several errors that are network related and may not known for users or project admins in stderr or other logs in vm. These issues could be network to host or server.

Yes you would need additional application as virtualbox to handle all subproject here except sixtrack. But those who use linux which several here do the native is a dream when you got it running. This build for production purpose and made to run for long sessions. Boinc is tiny compare to network around Cern and sure that could explain that they have time or take time to put support for every single user that would like to "try" a few task. Experience a errors and user leave. Simple task at set virtualization in bios could make some users to drop it as it would more simple to move to another project then doing these task. That is probably what would happen for some but not all.

There are some issues i experience and can not deal with without help from project admin or Cern devs would handle it, so in meantime i would wait it out. Most of people are in vacation and no sixtrack would be made so virtualbox is easy to setup now almost one-click thing now. It is what it is and i would do what i can to project.

Enjoy your vacation.
51) Message boards : ATLAS application : Processor Time Locks Up Elapsed Time Continues to Climb (Message 43123)
Posted 29 Jul 2020 by Greger
Post:
There is few download error on task:
WU download error: couldn't get input files


And valid task show:
2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!

2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!

2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!

.................

2020-07-28 08:57:01 (12684): Guest Log: No HITS file was produced


Could be network issue.
52) Questions and Answers : Windows : Virtualization capability read incorrectly (Message 43108)
Posted 26 Jul 2020 by Greger
Post:
Virtualization Virtualbox (6.1.12) installed, CPU does not have hardware virtualization support


This for both LHC and Cosmology. Looking on latest task at LHC:

VBoxManage.exe: error: Not in a hypervisor partition (HVP=0) (VERR_NEM_NOT_AVAILABLE).
VBoxManage.exe: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)
VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole


I looks to be enable in bios but system block it and Virtualbox try hyper v which should be off so it did not pass it.
You have an Enterprise x64 Edition licence and i would think it would be locked for user. If this is a computer to company i would avoid using virtualbox or any virtual session on it.

You cold task manager and go cpu it should say that virtualization is enable or not.
You try start vm session in virtualbox try if fail or not.
53) Message boards : CMS Application : CMS Tasks Failing (Message 43089)
Posted 19 Jul 2020 by Greger
Post:
All my CMS tasks for this afternoon have started to fail with error 206 (0x000000CE) EXIT_INIT_FAILURE. Here's one example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279343405


Same for me and most likely others to. Proxy issue and state that it could not pem file.

2020-07-19 19:27:47 (16908): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0

2020-07-19 19:27:47 (16908): Guest Log:        globus_credential: Error reading proxy credential

2020-07-19 19:27:47 (16908): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio

2020-07-19 19:27:47 (16908): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line
54) Message boards : Number crunching : Misconfigured Cloudflare Router (Message 43088)
Posted 19 Jul 2020 by Greger
Post:
http://cernvm.cern.ch/ also down today
55) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 43080)
Posted 16 Jul 2020 by Greger
Post:
Theory do not need singularity it would use runc instead are added in same way. Maybe Atlas could be coded to use runc instead of singularity?

It same error for both of us still.... this issue to singulari do get correct permission needs a check.

I have followed the guide at regarding mount issue on centos 8.2 2004 from https://cernvm.cern.ch/portal/filesystem/debugmount and no success. Have shutdown SELinux and same there.
I run in virtualbox for now and make few test cernvm but for now i have new clue what i can do. My options is to older distro or get on 20.04 and build for latest versions.
56) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 43070)
Posted 16 Jul 2020 by Greger
Post:
Update: Did fresh install on new 20.04 system and got same issue as before no permission to /var. I sure that singularity img is part of this as additional install correct it on main system permissions.

For further info i experience mount issue in both ubuntu and centos with 3.7.3.0 and 3.7.3.0-1. These appear to start fine at basic repo config to cvmfs but adding additional repo and proxy it get non-responsive and cvmfs stall at any command.
The application respond fine on help command but with any execution command or restart of service it stall or even lock-up until it give up.

In upcoming day i would proceed with older system but to check. In my view it is combo of main issue of singularity (only to ubuntu) but not in centos. Could be that use same permission as it use same system as img is made for. But in experience it had an effect in testing and versions with mix of system made harder but in general for now i found 2.7.3 up to 2.8.0 unstable on later systems. On top of this cvmfs tend to stall with running in root after restart/probe which make it harder. A force reboot to unmount have temp solution during test.

Ubuntu lack of some libs and centos need squashfs-tools and python. Cvmfs tend to post requirement and is good at post issues but in this time a fresh os got permission issues and stall make it near impossible to debugging.
Need to look more on network and repo and proxy to config or kernel as this could break cvmfs.

Fresh 20.04 OS system only client installed using singularity img. strangely post using 2.7.3.0 instead of 2.7.3.0-1 or default 2.7.3.1 that is out now.
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
03:33:11 (24630): wrapper (7.7.26015): starting
03:33:11 (24630): wrapper: running run_atlas (--nthreads 9)
[2020-07-15 03:33:11] Arguments: --nthreads 9
[2020-07-15 03:33:11] Threads: 9
[2020-07-15 03:33:11] Checking for CVMFS
[2020-07-15 03:33:14] Probing /cvmfs/atlas.cern.ch... OK
[2020-07-15 03:33:16] Probing /cvmfs/atlas-condb.cern.ch... OK
[2020-07-15 03:33:22] Probing /cvmfs/grid.cern.ch... OK
[2020-07-15 03:33:26] VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
[2020-07-15 03:33:26] 2.7.3.0 24790 0 23984 66905 3 1 101657 4096001 0 65024 0 0 n/a 17158 6248 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/atlas.cern.ch DIRECT 1
[2020-07-15 03:33:26] CVMFS is ok
[2020-07-15 03:33:26] Using singularity image /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
[2020-07-15 03:33:26] Checking for singularity binary...
[2020-07-15 03:33:26] Singularity is not installed, using version from CVMFS
[2020-07-15 03:33:26] Checking singularity works with /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname
[2020-07-15 03:36:33] INFO:  Convert SIF file to sandbox... ripper3-System INFO:  Cleaning up image...
[2020-07-15 03:36:33] Singularity works
[2020-07-15 03:36:33] Set ATHENA_PROC_NUMBER=9
[2020-07-15 03:36:33] Starting ATLAS job with PandaID=4786077036
[2020-07-15 03:36:33] Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/1 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img sh start_atlas.sh
[2020-07-15 03:36:37] Job failed
[2020-07-15 03:36:37] INFO:  Convert SIF file to sandbox...
[2020-07-15 03:36:37] INFO:  Cleaning up image...
[2020-07-15 03:36:37] FATAL:  container creation failed: mount ->/var error: can't remount /var: operation not permitted
[2020-07-15 03:36:37] ./runtime_log.err
[2020-07-15 03:36:37] ./runtime_log
03:46:38 (24630): run_atlas exited; CPU time 10.650186
03:46:38 (24630): app exit status: 0x1
03:46:38 (24630): called boinc_finish(195)

</stderr_txt>
Centos 8.2 2004 system not able to mount
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
22:29:13 (13069): wrapper (7.7.26015): starting
22:29:13 (13069): wrapper: running run_atlas (--nthreads 12)
[2020-07-15 22:29:13] Arguments: --nthreads 12
[2020-07-15 22:29:13] Threads: 12
[2020-07-15 22:29:13] Checking for CVMFS

</stderr_txt>
]]>


Yesterday before adding python-pip and additional tools

[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/common/exception.py", line 434, in run
[2020-07-15 11:38:50]     self._Thread__target(**self._Thread__kwargs)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 1785, in queue_monitor
[2020-07-15 11:38:50]     update_server(job, args)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 1835, in update_server
[2020-07-15 11:38:50]     send_state(job, args, job.state, xml=dumps(job.fileinfo), metadata=metadata)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 244, in send_state
[2020-07-15 11:38:50]     data = get_data_structure(job, state, args, xml=xml, metadata=metadata)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 543, in get_data_structure
[2020-07-15 11:38:50]     data['cpuConsumptionUnit'] = job.cpuconsumptionunit + "+" + get_cpu_model()
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/util/workernode.py", line 185, in get_cpu_model
[2020-07-15 11:38:50]     with open("/proc/cpuinfo", "r") as f:
[2020-07-15 11:38:50] exception caught by thread run() function: (<type 'exceptions.IOError'>, IOError(2, 'No such file or directory'), <traceback object at 0x7f9ca065ac20>)
[2020-07-15 11:38:50] Traceback (most recent call last):
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/common/exception.py", line 434, in run
[2020-07-15 11:38:50]     self._Thread__target(**self._Thread__kwargs)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 1785, in queue_monitor
[2020-07-15 11:38:50]     update_server(job, args)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 1835, in update_server
[2020-07-15 11:38:50]     send_state(job, args, job.state, xml=dumps(job.fileinfo), metadata=metadata)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 244, in send_state
[2020-07-15 11:38:50]     data = get_data_structure(job, state, args, xml=xml, metadata=metadata)
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/control/job.py", line 543, in get_data_structure
[2020-07-15 11:38:50]     data['cpuConsumptionUnit'] = job.cpuconsumptionunit + "+" + get_cpu_model()
[2020-07-15 11:38:50]   File "/home/ripper3/boinc/slots/1/pilot2/pilot/util/workernode.py", line 185, in get_cpu_model
[2020-07-15 11:38:50]     with open("/proc/cpuinfo", "r") as f:
[2020-07-15 11:38:50] IOError: [Errno 2] No such file or directory: '/proc/cpuinfo'


Today i drop any further test as system not able to mount and issues increase up adding to config and packages do not help from epel. Might need to move into intel to debugging as system also put varning that AMD Ryzen not tested in Centos 8. Could be kernel may not work properly.

Conclusion i can tell you if it is dependencies or lib issue here. During test i face more issue digging into it and probably a mix of issue that require install of sourcecode and additional tools to debugging it. Probably need to load stable kernel and check each packages going into that is required. I not much help but i do few other test before give up.
57) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 43052)
Posted 13 Jul 2020 by Greger
Post:
I have experience a permission issue as you can see post above. I have never run into this type of issue before on ubuntu nor centos or arch. My conclusion for now would be that container in this part singularity (theory with runc not tested) got permission issues that i could get pass without application installed on host. I do found any issue that it would need libs in error log in slots folder and same lines in is posted to stderr to task.
I have not follow track of process fully yet but Singularity claim to be fine in startup process long but failed later on around 10 min 4 sec probably when it try to start.

So it could be binaries do not get in correct group or something else.

When installed package for focal i first tried 18.04 package of cvmfs and failed. I have been waiting for focal package and when it got released i used .deb and it failed so it wiped and got package from repo and it was indeed same file but tested anyway and failed to.
Then i focus to make a build so made a build by clone from git and added zlib and libssl-dev along in build as it was required. At the end i started native again and same issue but later version. So i tested to install singularity version used before both in ubuntu and centos at it is singularity 3.5.2. This had another effect and it worked perfectly. So in my view singularity did not proper permission from singularity inside cvmfs img from my host as posted in log or new installed on host changed in install process to 3.5.2 or singularity in img have something missing or broken.

I did't need to change group mixed the cvmfs or boinc when i installed singularity. That confused me that from /var permission error to an install of container changed it outcome.

What i try to understand is how it post that container works but then failed. The check it does not include permission and what changed when choose to install on host instead. I have installed singularity on other ubuntu host as it was required before and left it running so i would need to try on another host.


Sanity check is needed and try on fresh system of focal tomorrow before do anything more. I not good at debugging and if could find cause it would pure luck. My conclusion is now it would work to clone git and add container to host and profit would also be to run on latest version.
58) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 43036)
Posted 12 Jul 2020 by Greger
Post:
The thing is that we can't use singularity that follow with cvmfs in focal. So we would need to install singularity separately to avoid singularity included in cvmfs. Most versions solid work just try latest stable.

This is a good reason to use singularity from CVMFS since it is always validated to work with ATLAS tasks.


This is incorrect for host that use Ubuntu 20.04. permission issues on every task. It would post that singularity is ok but it is not.

[2020-07-11 23:56:12] Using singularity image /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
[2020-07-11 23:56:12] Checking for singularity binary...
[2020-07-11 23:56:12] Singularity is not installed, using version from CVMFS
...........
[2020-07-11 23:56:23] Job failed
[2020-07-11 23:56:23] INFO:    Convert SIF file to sandbox...
[2020-07-11 23:56:23] INFO:    Cleaning up image...
[2020-07-11 23:56:23] FATAL:   container creation failed: mount ->/var error: can't remount /var: operation not permitted
[2020-07-11 23:56:23] ./runtime_log.err
[2020-07-11 23:56:23] ./runtime_log
00:06:24 (3073789): run_atlas exited; CPU time 12.010002
00:06:24 (3073789): app exit status: 0x1
00:06:24 (3073789): called boinc_finish(195)


This is for old cvmfs build for 18.04/19.04 cvmfs 2.7.x.x also latest from CernVM-FS Package Repositories for 20.04 cvmfs 2.7.3.0 also same error if clone from git and build with 2.8.0.0. All error out with same error line 0m container creation failed: mount ->/var error: can't remount /var: operation not permitted.

So for me i can only say that singularity that is included do not work for all systems.
59) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 43034)
Posted 12 Jul 2020 by Greger
Post:
Ok i have manage get it running and looks to be singularity all along. The singularity that is included does not work in focal or with version 2.7.3.0. I installed singularity 3.5.2 which worked well in ubuntu and centos and it started.

Before i tested it i made build from git and got cvmfs 2.8.0.0 instead of 2.7.3.0 which are pushed out for focal. So try it install singularity first and if that failed you could clone from git and get latest version.
60) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 43026)
Posted 11 Jul 2020 by Greger
Post:
OK i will let you know if it got fixed if you like to change to focal later.


Previous 20 · Next 20


©2024 CERN