Message boards : Number crunching : Big problem: work units running with negative time.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 10
Message 25744 - Posted: 4 Sep 2013, 10:17:14 UTC

I am getting a completely new experience with many of the new work units. They run to about 0.030% completion in 14 seconds on my computer, and then go back to zero %. The elapsed time in all those cases, as displayed in BOINC manager, also jumps back from around 14 to 4 seconds. Something I did not know was possible. In other words, elapsed time is jumping backwards.

I'm aborting those jobs now, because they just seem to be completely stuck without progress. Because it's my bed time I will suspend LHC for now. Please let me know if and when it is safe to resume LHC again.
ID: 25744 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25745 - Posted: 4 Sep 2013, 12:58:49 UTC - in response to Message 25744.  

we have looked at the workload in question. The results coming back are good.
The only thing is that this workload is for very steep amplitudes, which may be lost very quickly.
The jobs may be running very short and put more pressure on the network, then on the processor.

Also, now we have a new executable in production, which contains new physics.

In conclusion, please continue crunching and report back again if you discover new problems.
ID: 25745 · Report as offensive     Reply Quote
pvh

Send message
Joined: 17 Jun 13
Posts: 8
Credit: 6,548,286
RAC: 0
Message 25749 - Posted: 4 Sep 2013, 17:31:04 UTC

I too had 4 WUs thoroughly stuck at 0% for over 10 hours. They were consuming 0% CPU despite the fact that BOINC claimed that they were running. I aborted these units. There definitely is a problem with the new client. I successfully ran sixtrack units before. This is on openSUSE 12.3 with BOINC 7.0.65.
ID: 25749 · Report as offensive     Reply Quote
pvh

Send message
Joined: 17 Jun 13
Posts: 8
Credit: 6,548,286
RAC: 0
Message 25755 - Posted: 4 Sep 2013, 19:19:32 UTC

I got more WUs that all show the same problem. This is in the logs (repeated a zillion times):

2013-09-04T21:01:57 CEST | LHC@home 1.0 | Restarting task sd_sixt5_890_1.6_4D_err__26__s__62.31_60.32__8_10__6__15_1_sixvf_boinc2180_0 using sixtrack version 44603 (pni) in slot 30
2013-09-04T21:01:59 CEST | LHC@home 1.0 | Task sd_sixt5_890_1.6_4D_err__26__s__62.31_60.32__2_4__6__65_1_sixvf_boinc2139_0 exited with zero status but no 'finished' file
2013-09-04T21:01:59 CEST | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.


Each WU crashes as soon as it starts and then gets restarted indefinitely (I have never seen anything good come out of that policy, especially the indefinite part, but that is a different discussion...). This problem is 100% reproducible. Needless to say that resetting the project didn't help (does it ever?). I am disabling this project until I see a message here that this is solved or that we have reverted to a different client.
ID: 25755 · Report as offensive     Reply Quote
Tex1954

Send message
Joined: 24 Apr 11
Posts: 37
Credit: 1,295,012
RAC: 0
Message 25757 - Posted: 4 Sep 2013, 19:32:49 UTC - in response to Message 25744.  
Last modified: 4 Sep 2013, 20:00:00 UTC

I am getting a completely new experience with many of the new work units. They run to about 0.030% completion in 14 seconds on my computer, and then go back to zero %. The elapsed time in all those cases, as displayed in BOINC manager, also jumps back from around 14 to 4 seconds. Something I did not know was possible. In other words, elapsed time is jumping backwards.

I'm aborting those jobs now, because they just seem to be completely stuck without progress. Because it's my bed time I will suspend LHC for now. Please let me know if and when it is safe to resume LHC again.


I am getting the same problem. I look into the log file and see "No Heartbeat from client in 30 seconds, restarting..."

Major symptom is Elapsed Times keeps jumping backward...

As of an hour or so ago, all the new WU's failing...

I aborted the bad WU's and kept the good... NNT for now until I see the uploads proceed properly.

:)

PS: This problem only show up on my Linux machines so far... forgot to mention that...
ID: 25757 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 10
Message 25758 - Posted: 4 Sep 2013, 21:14:03 UTC - in response to Message 25744.  

Good morning from New Zealand.
I just came into office where my office computer crunches LHC too overnight. All of the recent tasks had crashed out with computation error or were stuck in the same (almost endless) loops.

Example is Task: 19624429

You will see it crashed out with a computation error after too many exits.

I have now seen this problem on 4 different computers at 2 different locations, so I don't think it's just me.

I run Xubuntu Linux on all my machines, versions 12.04 on 3 of them and 12.10 on 1.

Please do let me know when this problem is fixed, because I'm suspending LHC for now.
ID: 25758 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 25759 - Posted: 5 Sep 2013, 2:50:58 UTC

All systems are go on my Linux box, SuSE Linux 12.3, BOINC client 6.10.58. No problem so far.
Tullio
ID: 25759 · Report as offensive     Reply Quote
Tex1954

Send message
Joined: 24 Apr 11
Posts: 37
Credit: 1,295,012
RAC: 0
Message 25760 - Posted: 5 Sep 2013, 5:08:43 UTC
Last modified: 5 Sep 2013, 5:09:14 UTC

I tried a few new WU's again... same problem... I run Linux Mint 64b on my systems with BOINC CLIENT 7.0.65.

Oh well...

:D
ID: 25760 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 10
Message 25764 - Posted: 5 Sep 2013, 21:00:09 UTC - in response to Message 25760.  

Same here. I tried again this morning on my office computer, after Ubuntu pushed out a network manager update and I had rebooted the computer.

Results from the log file are:
Fri 06 Sep 2013 08:50:17 NZST | LHC@home 1.0 | Task W0w20cbb_w30cbb63__36__s__64.31_59.32__13_13.5__6__3_1_sixvf_boinc16640_1 exited with zero status but no 'finished' file

Fri 06 Sep 2013 08:50:17 NZST | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.

Fri 06 Sep 2013 08:50:17 NZST | LHC@home 1.0 | Restarting task W0w20cbb_w30cbb63__36__s__64.31_59.32__13_13.5__6__3_1_sixvf_boinc16640_1 using sixtrack version 44603 (pni) in slot 1

This repeats endlessly.

I'm running Xubuntu 12.04 (precise) 64-bit.
Linux Kernel 3.2.0-52-generic
BOINC version 7.0.65(x64)

As a side note, one of my home computers was still able to successfully complete an old LHC task this morning. Therefore, the problem is definitely caused by the new version.
ID: 25764 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 25765 - Posted: 5 Sep 2013, 22:23:59 UTC

With a couple of 13hr WUs just about to finish, 446.03 is running fine on Win7.
Obviously that's no help to you guys running Linux but at least Eric & co won't need to look for a global issue as the problem with the new version seems to be isolated just to Linux. Hopefully that means they will be able to resolve it sooner or revert to 444 while they fix it.
ID: 25765 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 25766 - Posted: 5 Sep 2013, 23:37:49 UTC - in response to Message 25765.  

With a couple of 13hr WUs just about to finish, 446.03 is running fine on Win7.
Obviously that's no help to you guys running Linux but at least Eric & co won't need to look for a global issue as the problem with the new version seems to be isolated just to Linux. Hopefully that means they will be able to resolve it sooner or revert to 444 while they fix it.

Did you try suspending your task (remove from memory), and starting it again?

The problem could be in the 'restart from checkpoint' area, which means it could still be cross-platform.
ID: 25766 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 25767 - Posted: 5 Sep 2013, 23:57:46 UTC

I have 3 tasks completed and validated on my Linux box, another is running and is at 66% after 14 hours. No problems here with the new version.
Tullio
ID: 25767 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 10
Message 25769 - Posted: 6 Sep 2013, 4:43:11 UTC - in response to Message 25767.  

Looking a bit more at what is happening. The executable file that gets downloaded on my Linux machine is:
sixtrack_lin64_4463_pni.exe

This is identified as a DOS/Windows executable by the file manager. It that is indeed what it is, then it's not surprising that Linux cannot execute it. It might conceivably be executable if I had WINE installed, but I do not, and I have not tried that. Perhaps other Linux users do have WINE installed and therefore get to execute it anyway, but that is just speculation.

I also note that the executable attribute bit is not set for the .exe file after the download.
ID: 25769 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 25770 - Posted: 6 Sep 2013, 5:08:01 UTC - in response to Message 25769.  

Yes, it looks like a DOS/Windows executable but it runs OK on my Linux box. I don't have WINE installed. The executable bit is set.
Tullio
ID: 25770 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25771 - Posted: 6 Sep 2013, 5:48:44 UTC

Thanks for all the (detailed) feedback.l Looks like a Linux problem on
"some" Linux systems. We tested on basically RedHat 6. All this compounded
by a series of unusually short WUs. Top priority to resolve soonest.
Don't want to go back because of an urgent need for the new physics.
(We do not yet have a MAC executable.) Famous last words, but it should
not be too difficult to resolve. Sorry for the hassle. Eric.
ID: 25771 · Report as offensive     Reply Quote
Jari Pyyluoma

Send message
Joined: 13 Sep 05
Posts: 4
Credit: 10,495,315
RAC: 0
Message 25772 - Posted: 6 Sep 2013, 6:02:59 UTC - in response to Message 25770.  

I saw this on win7 and ubuntu. I aborted tasks.
ID: 25772 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 25773 - Posted: 6 Sep 2013, 6:25:46 UTC

I tried to do a ldd on the executable, but it is not a dynamic executable. Must be a static one.
Tullio
ID: 25773 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 10
Message 25774 - Posted: 6 Sep 2013, 7:47:15 UTC - in response to Message 25773.  
Last modified: 6 Sep 2013, 7:47:41 UTC

Out of curiosity, I installed WINE and tried if that made a difference. Unfortunately, it doesn't. Behaviour remains as before: tasks keep looping.
ID: 25774 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 118
Credit: 12,588,679
RAC: 737
Message 25779 - Posted: 6 Sep 2013, 12:59:33 UTC - in response to Message 25766.  

With a couple of 13hr WUs just about to finish, 446.03 is running fine on Win7.
Obviously that's no help to you guys running Linux but at least Eric & co won't need to look for a global issue as the problem with the new version seems to be isolated just to Linux. Hopefully that means they will be able to resolve it sooner or revert to 444 while they fix it.

Did you try suspending your task (remove from memory), and starting it again?

The problem could be in the 'restart from checkpoint' area, which means it could still be cross-platform.


Running OK here on Ubuntu 8.04, it's Mint version, Ubuntu 10.04 and 10.10 (W2K and XP, too). All restart correctly from host power cycle so presumably the checkpoint mechanism is working.

John.
ID: 25779 · Report as offensive     Reply Quote
Profile Igor Zacharov
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 16 May 11
Posts: 79
Credit: 111,419
RAC: 0
Message 25780 - Posted: 6 Sep 2013, 15:42:43 UTC

There are problems with the Linux version 4463 and this is not related to naming.
I have verified, that even so the executables have .exe extension they are Linux.
Therefore, you don't need Wine to run them, it is all very standard.
By the way, I have corrected the scripts to name them differently next time.

There seems to be a correlation with Boinc Client version 7.x under Linux.
I'm not suggesting to go back in Client version, we must have a clean executable.

We are looking at the compilation of time accounting routines under Linux,
since there is evidence pointing to the mktime routine.

Probably will take until monday to resolve. Sorry for the inconvenience.

Igor.
skype id: igor-zacharov
ID: 25780 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Big problem: work units running with negative time.


©2024 CERN