Thread 'Tasks run 4 days and finish with error'

Author	Message
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42578 - Posted: 22 May 2020, 20:00:27 UTC - in response to Message 42577. The following comment is a personal one, not a moderator's comment. Peter Hucker wrote: The deadline needs to match the time limit. What should it be based on? Computers running 1 h/d , 3.2 h/d, 7.6 h/d, ... Computers running Mon-Fri, Mon+Tue+Fri, Sat+Sun ... And at what buffer size? 0.32 days, 3.44 days, 7 days ... More than what is is now. I get Theories with a 10 day deadline, which can end up with a 10 day limit. That requires 24/7 running with the task starting instantly I download it. Not going to happen. It was fine before - 4 day limit, 10 day deadline. And Theory only, or Theory + ATLAS, Atlas doesn't have long runtimes as far as I've seen. They start out saying they need 8 hours then usually finish in 4 hours. or LHC beside other projects (at what relative priority per project)? Doesn't matter. Once Boinc has a timer it knows will be close to the deadline, it engages panic mode so the Theory task will run all the time. Beside that individual runtimes per task can't be estimated beforehand. That's why I think they should be given a fairly short runtime limit of say 2 days to get 99.9% of them done, then those that don't complete on time can be handed out separately with a much larger runtime and deadline. And as I suggested with a tickbox so enthusiasts with computers running all the time can do those ones. NOGOOD wrote: Yes, tickbox on web preferences - superb idea. Perfect idea. Feel yourself encouraged to change the BOINC code accordingly: https://github.com/BOINC/boinc/tree/server_release/1/1.2 Why would the code need to be changed? In my LHC web preferences I can choose to do Sixtrack, Sixtrack test, CMS, Theory, Atlas. Why not have another one like the Sixtrack test, but for Theory? Call it Theory large or something. It would be treated as a seperate app, opt in for those who wanted to do it, and you could have different runtimes and deadlines for it. In addition: What would you suggest to do with the downstream and upstream processes? They may expect the results to be delivered within a given limit. Really, all of that to get a bit more than 99.9%? Well I don't know how these results are used. Do the scientists need every single result back or just most of them? What are they currently doing with ones that need a month to run? Are they run on LHC's own computers? Might be a good idea to also check the mcplots failure rate: http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=557509 http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=55945 All computers regularly running tasks are below 6 %. That's quite reliable. Thank you guys. Not sure what I'm looking at there. Are you saying that those of us who run a lot of Theories are failing 6% of them? I'd call that not very good. I thought you said 99.9% were ok? Sorry, don't want to disappoint you. Comments are always important to make the admins aware of server/project errors. We shall argue with you until you see the light ;-) ID: 42578 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 280	Message 42579 - Posted: 22 May 2020, 20:12:52 UTC - in response to Message 42577. Really, all of that to get a bit more than 99.9%? Oh... I'll repeat: if successful tasks running many-many days is not especially valuable for the Project, I give up. No reasons to continue this discussion. ID: 42579 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42580 - Posted: 22 May 2020, 20:19:26 UTC - in response to Message 42579. Really, all of that to get a bit more than 99.9%? Oh... I'll repeat: if successful tasks running many-many days is not especially valuable for the Project, I give up. No reasons to continue this discussion. I don't see why the admins find this so difficult. Presumably they're not interested in results over x days. So they should set the limit at that, and the deadline suitably longer so we can actually get it done, then if it fails it fails. If they need every single one completed, then those longer ones need to be done in another way, perhaps on their own fast computers, or handed out again with different deadlines and time limits. ID: 42580 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 280	Message 42581 - Posted: 22 May 2020, 20:23:30 UTC - in response to Message 42578. Well I don't know how these results are used. Do the scientists need every single result back or just most of them? What are they currently doing with ones that need a month to run? Are they run on LHC's own computers? Support this questions. Actually, my activity is the way to find out the same. ID: 42581 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 42582 - Posted: 22 May 2020, 20:25:46 UTC - in response to Message 42578. More than what is is now. Unspecific? Where can "more" be set? ... Boinc has a timer it knows will be close to the deadline ... No. That's exactly what BOINC does not know in case of Theory. Are you saying that those of us who run a lot of Theories are failing 6% of them? It's the mcplots perspective (per computer), not the BOINC perspective. 99.9 % return rate (mcplots total) within 2.32 d runtime, but not all of them succeed. ID: 42582 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 42583 - Posted: 22 May 2020, 20:38:46 UTC - in response to Message 42581. Just change your view. Scientists define thousands of parameter sets that they want to be simulated. At that time they don't know how the tasks generated from that parameter sets behave. Most of the tasks deliver valuable results. A few don't and scientists check if the successful tasks answer their questions. If not they will investigate why and modify the parameters or applications and generate new parameter sets. ID: 42583 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42584 - Posted: 22 May 2020, 20:45:10 UTC - in response to Message 42582. More than what is is now. Unspecific? Where can "more" be set? The deadline is set at LHC. Yours is 10 days, Universe is 14, Rosetta is 3. ... Boinc has a timer it knows will be close to the deadline ... No. That's exactly what BOINC does not know in case of Theory. Of course it does. There's a limiter at which point the task gives up . This was 4 days and now is sometimes 10 days. After the task has taken more than the average time (or something else tells Boinc it's now a long task, not sure what makes it change over), the "time left" Boinc reports jumps to 4 or 10 days. Boinc knows the deadline too. If the time left will go past or close to the deadline, Boinc runs that task on high priority. Are you saying that those of us who run a lot of Theories are failing 6% of them? It's the mcplots perspective (per computer), not the BOINC perspective. 99.9 % return rate (mcplots total) within 2.32 d runtime, but not all of them succeed. Sorry, I have no idea what that means. ID: 42584 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42585 - Posted: 22 May 2020, 20:47:51 UTC - in response to Message 42583. Just change your view. Scientists define thousands of parameter sets that they want to be simulated. At that time they don't know how the tasks generated from that parameter sets behave. Most of the tasks deliver valuable results. A few don't and scientists check if the successful tasks answer their questions. If not they will investigate why and modify the parameters or applications and generate new parameter sets. It's not a view, it's a question we were both asking. It seems to be very difficult to find out what scientists actually do with these task results. Do they need them all back, do they only need most of them, does it matter if some fail? Do those that failed get run elsewhere on your own computers? Do they get broken down into smaller tasks? ID: 42585 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 280	Message 42586 - Posted: 22 May 2020, 20:57:30 UTC - in response to Message 42585. Last modified: 22 May 2020, 21:35:04 UTC Just change your view. Scientists define thousands of parameter sets that they want to be simulated. At that time they don't know how the tasks generated from that parameter sets behave. Most of the tasks deliver valuable results. A few don't and scientists check if the successful tasks answer their questions. If not they will investigate why and modify the parameters or applications and generate new parameter sets. It's not a view, it's a question we were both asking. It seems to be very difficult to find out what scientists actually do with these task results. Do they need them all back, do they only need most of them, does it matter if some fail? Do those that failed get run elsewhere on your own computers? Do they get broken down into smaller tasks? Exactly. If they do not need them all back, we should set runtime limit of mentioned 2.32 days, feel ourselves useful and do nothing more. ID: 42586 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 42587 - Posted: 22 May 2020, 22:02:31 UTC - in response to Message 42568. I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs. I now have, for the five jobs: Elapsed: 5d 22:33:00 Remaining: 4d 01:37:00 Deadline: 5/26/20 10:43:19 AM It is currently 5/22/20 11:06:00 AM Is that the total of all 5? What is each one at? ID: 42587 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42588 - Posted: 22 May 2020, 22:16:05 UTC - in response to Message 42587. Last modified: 22 May 2020, 22:16:54 UTC I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs. I now have, for the five jobs: Elapsed: 5d 22:33:00 Remaining: 4d 01:37:00 Deadline: 5/26/20 10:43:19 AM It is currently 5/22/20 11:06:00 AM Is that the total of all 5? What is each one at? Strange. That computer has returned 25 good theories recently in just over an hour on average. It's odd that you now have 5 that are all taking that long. I usually just get the odd one that's a long runner. Mind you, you have 11 errors returned in about 13 minutes each. Do you know what happened to them? Was the computer being rebooted at the time, or computation paused for a game, or tasks swapped to run another project? ID: 42588 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 42589 - Posted: 22 May 2020, 22:49:24 UTC - in response to Message 42588. Last modified: 22 May 2020, 22:54:21 UTC Strange. That computer has returned 25 good theories recently in just over an hour on average. It's odd that you now have 5 that are all taking that long. I usually just get the odd one that's a long runner. Mind you, you have 11 errors returned in about 13 minutes each. Do you know what happened to them? Was the computer being rebooted at the time, or computation paused for a game, or tasks swapped to run another project? I must admit I'm rather new to running LHC@Home on virtual machines, starting only a month ago. I have had errors since then, as my logs show. At first it was due to this being done on a new PC needing reboots, and whenever I rebooted I would find a mess of aborted machines on the VirtualBox which wouldn't clean themselves up, and I seem to remember problems with the following jobs until I went in and manually logged the machines out and removed them all. Even now, LHC@Home thinks I am running three ATLAS jobs I don't have, and when I look at the VM console for the first Theory job I get this: I assume the rest are the same. Forgive my ignorance, but is there any point letting this apparently crashed process sit on my system, pretending to be using up a hyperthread? Or is the error non-fatal? The error looks like it is due to a network problem, in which case the job could complete successfully if re-run. ID: 42589 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42590 - Posted: 22 May 2020, 22:58:05 UTC - in response to Message 42589. I must admit I'm rather new to running LHC@Home on virtual machines, starting only a month ago. I have had errors since then, as my logs show. At first it was due to this being done on a new PC needing reboots, and whenever I rebooted I would find a mess of aborted machines on the VirtualBox which wouldn't clean themselves up, and I seem to remember problems with the following jobs until I went in and manually logged the machines out and removed them all. All I can think of is to make sure you're running the latest Virtualbox and Extension Pack (Boinc doesn't always supply these, I get them straight from the Oracle site - launch Virtualbox from the start menu and you should get prompted to update, or it's in a menu somewhere). I've always used the latest version and never had a problem. Very occasionally a Theory won't continue after a reboot etc, but mostly they just continue like any other task. Those that won't, just say "computation error", and it starts another. I don't get them jammed and have to intervene. Even now, LHC@Home thinks I am running three ATLAS jobs I don't have, and when I look at the VM console for the first Theory job I get this: I assume the rest are the same. Forgive my ignorance, but is there any point letting this apparently crashed process sit on my system, pretending to be using up a hyperthread? Or is the error non-fatal? If I abort the job, will LHC@Home still find out that the job crashed so it won't get farmed out to someone else and waste 10 days of their time too? And even if I let it "run", if it is returned past the deadline as it promises to do, will it be in time to prevent it being reissued? Not sure what that error means, looks like it can't download something from the CERN server. Hopefully an admin in here can explain. ID: 42590 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1312 Credit: 97,694,594 RAC: 106,766	Message 42591 - Posted: 23 May 2020, 9:46:46 UTC - in response to Message 42565. [quote]NG You might want to update your VB version since I think you are still running a 2019 version and Oracle tends to do lots of updates to fix the usual problems. (VirtualBox 5.2.34 (released October 15 2019) Mine isn't on the newest list ( that I use here) VirtualBox 6.1.6 but it works And I also have tested lots of them with VirtualBox 6.1.8 and no problems running Theory tasks https://www.virtualbox.org/wiki/Download_Old_Builds https://www.virtualbox.org/wiki/Downloads I even have had good luck with the Sherpa and the many other event generators. Peter Hucker I have always used the latest version and never had any problems. (damn this thread is faster than the server) Yes Peter I knew you had been up to date and that was for NOGOOD and I figured he would see that I said that to NG and go from there. ID: 42591 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 42596 - Posted: 23 May 2020, 18:15:54 UTC - in response to Message 42591. NG You might want to update your VB version since I think you are still running a 2019 version and Oracle tends to do lots of updates to fix the usual problems. (VirtualBox 5.2.34 (released October 15 2019) Mine isn't on the newest list ( that I use here) VirtualBox 6.1.6 but it works And I also have tested lots of them with VirtualBox 6.1.8 and no problems running Theory tasks https://www.virtualbox.org/wiki/Download_Old_Builds https://www.virtualbox.org/wiki/Downloads I even have had good luck with the Sherpa and the many other event generators. Peter Hucker I have always used the latest version and never had any problems. (damn this thread is faster than the server) Yes Peter I knew you had been up to date and that was for NOGOOD and I figured he would see that I said that to NG and go from there. The latest version of VirtualBox available is 6.1.8. Mine is 6.1.6, but to upgrade it I would have to end my LHC jobs. I've even wondered if the recentness of my VirtualBox is a problem, since BOINC recommends the older version they distribute with the BOINC package. It is possible that my network was down at the moment my latest five jobs were issued, but not very likely. Plus it would be nice if the software would try again. So I take it I can abort these five jobs and save three days of imaginary crunching? ID: 42596 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42597 - Posted: 23 May 2020, 18:53:46 UTC - in response to Message 42596. The latest version of VirtualBox available is 6.1.8. Mine is 6.1.6, but to upgrade it I would have to end my LHC jobs. I didn't have to end mine. I exited Boinc manager and ticked "stop tasks". Then I upgraded Virtual box. Then I restarted Boinc. The tasks continued where they left off. I've even wondered if the recentness of my VirtualBox is a problem, since BOINC recommends the older version they distribute with the BOINC package. No idea why they say that. With LHC anyway, the latest one always works fine. It is possible that my network was down at the moment my latest five jobs were issued, but not very likely. Plus it would be nice if the software would try again. So I take it I can abort these five jobs and save three days of imaginary crunching? Are they using your CPU time in the task manager? If not, abort them. If they are, you could wait and see if a couple are long runners that finish within the time frame, but it's not likely. ID: 42597 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1312 Credit: 97,694,594 RAC: 106,766	Message 42598 - Posted: 23 May 2020, 20:44:54 UTC - in response to Message 42597. Last modified: 23 May 2020, 20:50:48 UTC TRUE I have been running them for over 9 years now and to do the VB/Extension Pack update I just suspend my running tasks and do the VB update and reboot and restart the tasks. But make sure they are suspended by first checking your VB Manager (and as far as what running tasks are doing you can always check the task logs in the VB Manager) OH and I never go by "BOINC recommends the older version" since they don't update as often as Oracle VB....... they tend to update more than Windows 10 does That is basically still there for new members. ID: 42598 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 469 Credit: 15,295,644 RAC: 9,181	Message 42599 - Posted: 23 May 2020, 21:06:23 UTC - in response to Message 42598. TRUE I have been running them for over 9 years now and to do the VB/Extension Pack update I just suspend my running tasks and do the VB update and reboot and restart the tasks. I wasn't sure about doing that, especially since I have ticked "leave in memory while suspended", since that's meant to make them more likely to be able to continue if Boinc suspends them for playing a game, or to run another project. And anyway I wasn't sure if "suspend" actually closed Virtualbox. OH and I never go by "BOINC recommends the older version" since they don't update as often as Oracle VB....... they tend to update more than Windows 10 does That is basically still there for new members. I don't see any point in recommending an older version. ID: 42599 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 42600 - Posted: 24 May 2020, 1:34:15 UTC - in response to Message 42597. It is possible that my network was down at the moment my latest five jobs were issued, but not very likely. Plus it would be nice if the software would try again. So I take it I can abort these five jobs and save three days of imaginary crunching? Are they using your CPU time in the task manager? If not, abort them. If they are, you could wait and see if a couple are long runners that finish within the time frame, but it's not likely. Until you asked I hadn't noticed I could expand the BOINC tasks apart in Task Manager (Windows isn't my primary OS) and I see only Rosetta and WCG using my CPU: So I will abort the LHC jobs. Thanks for all the help. ID: 42600 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 43177 - Posted: 3 Aug 2020, 23:22:26 UTC - in response to Message 42600. I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox. I wonder if I am the only one being afflicted by ~80% of the jobs I receive stalling in an error? In this image I am referring to the second job, which claims to be running on BOINC manager but isn't actually working. It is wasting a thread, and will continue to waste a thread until I manually delete the job. Not only is this tedious to keep doing, it is wasting a good fraction of my computer's processing capacity on an ongoing basis because I usually have ten such defunct jobs at a time.. Both BOINC manager (version 7.26.7) and 64 bit VirtualBox (version 6.1.12) are up to date, running on Windows 10 Pro. Everything is stock. I don't know what to fix to get jobs to work, assuming everybody else's jobs are working. ID: 43177 · Reply Quote