Message boards :
Theory Application :
New version 263.95
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
As mentioned a couple of times this is caused by the fact that ATLAS doesn't correctly respect the #cores parameter (as it was originally introduced). It is because their accountants insist that they be counted wrong. I am not making this up. Don't ask again, or they might make it worse. |
Send message Joined: 24 Oct 04 Posts: 1115 Credit: 49,710,908 RAC: 14,193 |
14 Valids and just one of these so far https://lhcathome.cern.ch/lhcathome/result.php?resultid=236599310 |
Send message Joined: 24 Oct 04 Posts: 1115 Credit: 49,710,908 RAC: 14,193 |
Well I got 23 Valids and many more running BUT I got 3 of these https://lhcathome.cern.ch/lhcathome/result.php?resultid=236609765 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED And just one of these for some reason https://lhcathome.cern.ch/lhcathome/result.php?resultid=236599310 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT And then two of those typical [ERROR] Condor ended after 1032 seconds. but the next 2 on that host are running now. 3am |
Send message Joined: 24 Oct 04 Posts: 1115 Credit: 49,710,908 RAC: 14,193 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=236608596 12 of these now and I have seen hundreds for other members with these. (Just took a quick look and the single core and 2-core tasks are working but not the 4-core tasks for me and it isn't because or Ram (but I saw quite a few on other members that failed with single core but didn't look at what VB version they used and now we only have a few running this VB Theory version and of course for some reason most of them have the pc's hidden so we can't see if they have the same problems or other types) I can think of another name for this error right now |
Send message Joined: 14 Jan 10 Posts: 1272 Credit: 8,479,164 RAC: 2,361 |
12 of these now and I have seen hundreds for other members with these. The best solution is to tenfold the value of <rsc_fpops_bound>2000000000000000.000000</rsc_fpops_bound>, what should be done server-wise by the admins. Your temporary solution is to decrease the # of cores or could be reducing the value of <job_duration>64800</job_duration> in the file Theory_2017_05_29.xml (project directory) to 43200 or even lower if needed. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,352,862 RAC: 123,063 |
... reducing the value of <job_duration>64800</job_duration> in the file Theory_2017_05_29.xml (project directory) to 43200 or even lower if needed. Not good as it will kill the running job when it hits the limit. Reducing the #cores at the web preferences page would be better until the admins will have raised the rsc_fpops_bound. |
Send message Joined: 14 Jan 10 Posts: 1272 Credit: 8,479,164 RAC: 2,361 |
Not good as it will kill the running job when it hits the limit.That's true, but it's doing now too, but a bit later causing an error task for the user without credits, although he/she has done ~13 hours of useful work. |
Send message Joined: 24 Oct 04 Posts: 1115 Credit: 49,710,908 RAC: 14,193 |
Thank you PurpleHat It helps when us members who run hundreds of these tasks to let us take a look at other members hosts to see if we are all having the same errors and many times compare similar computers when we have any problems here or where we test them before they get here. It doesn't help us with these projects to have hidden computers and any time I have any problem at all I prefer to look and see if it is just me or if the same problem happens for many others running the same tasks on the same OS's and to see how many cores per tasks also for the same reasons. |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,523,451 RAC: 15,579 |
I just started to do Theory tasks with the new app version on one of my hosts after one week break. All tasks are failing after about 13.5 hours with error 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED. Why these tasks are not allowed to continue to the 18 hour mark like they used to on previous app version? Here's one https://lhcathome.cern.ch/lhcathome/result.php?resultid=236942503 |
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 102,936,740 RAC: 125,148 |
Harri, what is the #CPUs in your Web Settings? |
Send message Joined: 14 Jan 10 Posts: 1272 Credit: 8,479,164 RAC: 2,361 |
Harri, what is the #CPUs in your Web Settings?A part of the answer: Setting Memory Size for VM. (1730MB) 11 cores! |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,523,451 RAC: 15,579 |
Harri, what is the #CPUs in your Web Settings?A part of the answer: Setting Memory Size for VM. (1730MB) 11 cores! The number of CPUs in web settings is 4 but I run all tasks just with 1 CPU, I use 4 CPUs when I run Atlas tasks. These settings come from app_config.xml. The computer has 64GB memory so that is not a problem (20 GB now in use when 10 Theory tasks are running concurrently, also 1 CPDN task and 2 Einstein GPU tasks and 2 Seti GPU tasks are running). So should I change the number of CPUs on web settings to 1? |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,523,451 RAC: 15,579 |
During the night a few tasks were valid. The runtime for them was < 51309 seconds that seems to be the limit for this 197 error (2000000.00G/38.72G). I don't know where the 38.72G comes from. Problem with these failed tasks is that they are not cleared from VM Manager but they stay there until manually removed (or maybe Boinc restart might clear them, I haven't tried). This has also created a problem with some tasks that did run for 43 seconds and are now postponed. I will just abort them and set the web preferences to 1 CPU before down loading any new Theory tasks. It is win 10 patch tuesday, so I'll just wait for the updates and then restart my computer. |
Send message Joined: 14 Jan 10 Posts: 1272 Credit: 8,479,164 RAC: 2,361 |
Even VM's with a single core setting can suffer from too low defined <rsc_fpops_bound>. Admins please increase that value (tenfold). https://lhcathome.cern.ch/lhcathome/result.php?resultid=237057862 LHC@home 13 Jul 03:07:03 Aborting task Theory_3495168_1562869652.901284_0: exceeded elapsed time limit 103033.71 (2000000.00G/19.34G) This time it was a long running Sherpa needing more time: ===> [runRivet] Fri Jul 12 07:17:07 CEST 2019 [boinc pp ue 200 4 - sherpa 2.2.4 default 7000 78] |
Send message Joined: 2 May 07 Posts: 2090 Credit: 158,733,549 RAC: 128,321 |
Crystal there is a new Version 263.96 active since yesterday? Do you know the changes? |
Send message Joined: 24 Oct 04 Posts: 1115 Credit: 49,710,908 RAC: 14,193 |
Crystal there is a new Version 263.96 active since yesterday? I was about to try several Theory tasks again since my new month of high-speed satellite starts in 15 minutes but since it is a new vdi I will have to wait and see if anyone else has these working on the Win 10 OS before I do that so I don't end up having to do it again if they don't work. Instead I am going to run several CMS-dev and see if they decide to work just by having them start faster since they were not working last week on the 10 OS (and run sixtracks here for now since they don't depend on internet speed or even connection) |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,352,862 RAC: 123,063 |
The server still sends out v263.95. I guess it has not been restarted. A check of the manually downloaded vdi shows that the CVMFS typo has been corrected that I mentioned here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=479&postid=6435 It might solve the X509 error when it is available. |
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 102,936,740 RAC: 125,148 |
The server still sends out v263.95.In fact, I havn't had this error with v263.95 (only with v263.90) |
©2024 CERN