Message boards : Theory Application : New version 263.95
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 39276 - Posted: 4 Jul 2019, 15:35:04 UTC - in response to Message 39275.  

As mentioned a couple of times this is caused by the fact that ATLAS doesn't correctly respect the #cores parameter (as it was originally introduced).


This is an ongoing problem that can be compensated for but it's confusing for a lot of volunteers. It really needs to be fixed.

It is because their accountants insist that they be counted wrong. I am not making this up.
Don't ask again, or they might make it worse.
ID: 39276 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1115
Credit: 49,710,908
RAC: 14,193
Message 39282 - Posted: 4 Jul 2019, 20:03:58 UTC

ID: 39282 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1115
Credit: 49,710,908
RAC: 14,193
Message 39285 - Posted: 5 Jul 2019, 9:59:42 UTC - in response to Message 39282.  

Well I got 23 Valids and many more running BUT I got 3 of these https://lhcathome.cern.ch/lhcathome/result.php?resultid=236609765

197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED

And just one of these for some reason https://lhcathome.cern.ch/lhcathome/result.php?resultid=236599310

194 (0x000000C2) EXIT_ABORTED_BY_CLIENT

And then two of those typical
[ERROR] Condor ended after 1032 seconds. but the next 2 on that host are running now.
3am
ID: 39285 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1115
Credit: 49,710,908
RAC: 14,193
Message 39291 - Posted: 6 Jul 2019, 5:40:11 UTC
Last modified: 6 Jul 2019, 6:31:42 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=236608596

12 of these now and I have seen hundreds for other members with these.

(Just took a quick look and the single core and 2-core tasks are working but not the 4-core tasks for me and it isn't because or Ram (but I saw quite a few on other members that failed with single core but didn't look at what VB version they used and now we only have a few running this VB Theory version and of course for some reason most of them have the pc's hidden so we can't see if they have the same problems or other types)

I can think of another name for this error right now
ID: 39291 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1272
Credit: 8,479,164
RAC: 2,361
Message 39293 - Posted: 6 Jul 2019, 7:45:04 UTC - in response to Message 39291.  

12 of these now and I have seen hundreds for other members with these.

The best solution is to tenfold the value of <rsc_fpops_bound>2000000000000000.000000</rsc_fpops_bound>, what should be done server-wise by the admins.

Your temporary solution is to decrease the # of cores or could be reducing the value of <job_duration>64800</job_duration> in the file Theory_2017_05_29.xml (project directory) to 43200 or even lower if needed.
ID: 39293 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,352,862
RAC: 123,063
Message 39294 - Posted: 6 Jul 2019, 7:57:49 UTC - in response to Message 39293.  

... reducing the value of <job_duration>64800</job_duration> in the file Theory_2017_05_29.xml (project directory) to 43200 or even lower if needed.

Not good as it will kill the running job when it hits the limit.
Reducing the #cores at the web preferences page would be better until the admins will have raised the rsc_fpops_bound.
ID: 39294 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1272
Credit: 8,479,164
RAC: 2,361
Message 39295 - Posted: 6 Jul 2019, 8:07:08 UTC - in response to Message 39294.  

Not good as it will kill the running job when it hits the limit.
That's true, but it's doing now too, but a bit later causing an error task for the user without credits, although he/she has done ~13 hours of useful work.
ID: 39295 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1115
Credit: 49,710,908
RAC: 14,193
Message 39310 - Posted: 7 Jul 2019, 9:21:57 UTC - in response to Message 39308.  
Last modified: 7 Jul 2019, 9:23:52 UTC


Edit: link on URL added for you computezmle and unhidden host and for MAGIC Quantum Mechanic as you unlike hidden host for theory.


Thank you PurpleHat

It helps when us members who run hundreds of these tasks to let us take a look at other members hosts to see if we are all having the same errors and many times compare similar computers when we have any problems here or where we test them before they get here.
It doesn't help us with these projects to have hidden computers and any time I have any problem at all I prefer to look and see if it is just me or if the same problem happens for many others running the same tasks on the same OS's and to see how many cores per tasks also for the same reasons.
ID: 39310 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,523,451
RAC: 15,579
Message 39323 - Posted: 9 Jul 2019, 18:03:40 UTC

I just started to do Theory tasks with the new app version on one of my hosts after one week break. All tasks are failing after about 13.5 hours with error 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED. Why these tasks are not allowed to continue to the 18 hour mark like they used to on previous app version?

Here's one https://lhcathome.cern.ch/lhcathome/result.php?resultid=236942503
ID: 39323 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 102,936,740
RAC: 125,148
Message 39324 - Posted: 9 Jul 2019, 19:41:16 UTC

Harri, what is the #CPUs in your Web Settings?
ID: 39324 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1272
Credit: 8,479,164
RAC: 2,361
Message 39325 - Posted: 9 Jul 2019, 19:43:17 UTC - in response to Message 39324.  

Harri, what is the #CPUs in your Web Settings?
A part of the answer: Setting Memory Size for VM. (1730MB) 11 cores!
ID: 39325 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,523,451
RAC: 15,579
Message 39329 - Posted: 9 Jul 2019, 20:48:52 UTC - in response to Message 39325.  

Harri, what is the #CPUs in your Web Settings?
A part of the answer: Setting Memory Size for VM. (1730MB) 11 cores!

The number of CPUs in web settings is 4 but I run all tasks just with 1 CPU, I use 4 CPUs when I run Atlas tasks. These settings come from app_config.xml. The computer has 64GB memory so that is not a problem (20 GB now in use when 10 Theory tasks are running concurrently, also 1 CPDN task and 2 Einstein GPU tasks and 2 Seti GPU tasks are running).

So should I change the number of CPUs on web settings to 1?
ID: 39329 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,523,451
RAC: 15,579
Message 39332 - Posted: 10 Jul 2019, 9:12:29 UTC

During the night a few tasks were valid. The runtime for them was < 51309 seconds that seems to be the limit for this 197 error (2000000.00G/38.72G). I don't know where the 38.72G comes from.

Problem with these failed tasks is that they are not cleared from VM Manager but they stay there until manually removed (or maybe Boinc restart might clear them, I haven't tried). This has also created a problem with some tasks that did run for 43 seconds and are now postponed. I will just abort them and set the web preferences to 1 CPU before down loading any new Theory tasks.

It is win 10 patch tuesday, so I'll just wait for the updates and then restart my computer.
ID: 39332 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1272
Credit: 8,479,164
RAC: 2,361
Message 39345 - Posted: 13 Jul 2019, 5:27:25 UTC

Even VM's with a single core setting can suffer from too low defined <rsc_fpops_bound>. Admins please increase that value (tenfold).

https://lhcathome.cern.ch/lhcathome/result.php?resultid=237057862

LHC@home 13 Jul 03:07:03 Aborting task Theory_3495168_1562869652.901284_0: exceeded elapsed time limit 103033.71 (2000000.00G/19.34G)

This time it was a long running Sherpa needing more time: ===> [runRivet] Fri Jul 12 07:17:07 CEST 2019 [boinc pp ue 200 4 - sherpa 2.2.4 default 7000 78]
ID: 39345 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,733,549
RAC: 128,321
Message 39346 - Posted: 13 Jul 2019, 5:48:21 UTC

Crystal there is a new Version 263.96 active since yesterday?
Do you know the changes?
ID: 39346 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1115
Credit: 49,710,908
RAC: 14,193
Message 39347 - Posted: 13 Jul 2019, 6:50:59 UTC - in response to Message 39346.  

Crystal there is a new Version 263.96 active since yesterday?
Do you know the changes?



I was about to try several Theory tasks again since my new month of high-speed satellite starts in 15 minutes but since it is a new vdi I will have to wait and see if anyone else has these working on the Win 10 OS before I do that so I don't end up having to do it again if they don't work.

Instead I am going to run several CMS-dev and see if they decide to work just by having them start faster since they were not working last week on the 10 OS

(and run sixtracks here for now since they don't depend on internet speed or even connection)
ID: 39347 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,352,862
RAC: 123,063
Message 39348 - Posted: 13 Jul 2019, 7:21:05 UTC

The server still sends out v263.95.
I guess it has not been restarted.

A check of the manually downloaded vdi shows that the CVMFS typo has been corrected that I mentioned here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=479&postid=6435
It might solve the X509 error when it is available.
ID: 39348 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 102,936,740
RAC: 125,148
Message 39363 - Posted: 16 Jul 2019, 5:15:49 UTC - in response to Message 39348.  

The server still sends out v263.95.
I guess it has not been restarted. ...
It might solve the X509 error when it is available.
In fact, I havn't had this error with v263.95 (only with v263.90)
ID: 39363 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Theory Application : New version 263.95


©2024 CERN