1) Message boards : Theory Application : Theory Task doing nothing (Message 42930)
Posted 7 days ago by CloverField
Post:
Found the source of the issue looks like a squid pemissions issue. Lots of logs in the file saying permission denied. Just need to wait for some cms tasks to finish and then Ill redo my squid cache.
2) Message boards : ATLAS application : Squid proxies may need restart (Message 42786)
Posted 2 Jun 2020 by CloverField
Post:
This is also in regards to your post in the Theory thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5431&postid=42775


You may first check your access.log and cache.log.
Do you notice error messages that correspond to your issues?

If no, squid is most likely running fine and the issues are caused by something else.

If yes, you should clear the cache and restart fresh.

You may also insert the following line in your squid.conf and do a "squid -k reconfigure".
shutdown_lifetime 3 seconds

This avoids the 60 seconds default delay when you shutdown/restart squid but I'm not 100% sure if changing this timeout requires a squid -k restart. At least Squid will be prepared for the next restart.


The logs look good to me. And post the squid restart everything seems to be fine.
I just got some atlas tasks though so I assume they will kill at least one theory.
If I get another stuck one. Ill nuke the cache and also do a project reset to see if that solves the issue.
3) Message boards : Theory Application : Theory Task doing nothing (Message 42781)
Posted 2 Jun 2020 by CloverField
Post:
Ok got another one that was just stuck there with the same message.
This time it was not due to task switching.

Could it be due to the squid cache that I set up earlier?

Hopefully this will update to something more helpful then aborted by user.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=275990643


Just restarted squid for ATLAS, I'll see if this fixes the theory issues as well.
4) Message boards : ATLAS application : Squid proxies may need restart (Message 42777)
Posted 2 Jun 2020 by CloverField
Post:
Hi all,

This message is only relevant if you run your own squid proxy server for ATLAS tasks.

After the CERN database outage last week, a problem was seen with the cached information on squid proxy servers all over the ATLAS Grid which can cause tasks to fail. The solution to the problem is to restart the squid service, so if you are running your own squid please restart it in order to avoid potential problems.

The ATLAS-managed squid servers which tasks use by default were restarted earlier today, so if you saw strange failures in tasks between Thursday last week and now this might have been the reason.


By restart do you just mean squid -k restart
or deleting the cache and starting fresh?
5) Message boards : Theory Application : Theory Task doing nothing (Message 42775)
Posted 2 Jun 2020 by CloverField
Post:
Ok got another one that was just stuck there with the same message.
This time it was not due to task switching.

Could it be due to the squid cache that I set up earlier?

Hopefully this will update to something more helpful then aborted by user.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=275990643
6) Message boards : Number crunching : How does task switching actually work? (Message 42765)
Posted 2 Jun 2020 by CloverField
Post:
The main issue with that for me is ATLAS loves to give me 8 core tasks. Which then kick 8 other jobs to the side, and usually break them.
What I want my tasks to do is say hey an atlas 8 core is ready. Let 8 more tasks finish and then slot the atlas in the free space. I could limit ATLAS to one core task,
but that kinds defeats the point of a threadrippper no?


This is a major flaw in the workings of the Boinc scheduler. You could try asking them to sort it, but they're a strange bunch. And you'll have to visit them in Github as they don't listen to anyone in the forums.

Although, why are your single core tasks breaking? The only problem I end up with is stuff not meeting deadlines. The Boinc scheduler is rubbish at that, it leaves things to the last minute, and if you happen to have the computer off or playing a game etc, you're a bit late sending them back.


It seems to be due to net work IO, only happens if the task switches in like the first ten minutes or however long it takes to configure itself, but when ATLAS forces the task swap, they are waiting to get something and when they come back on line they are still in that waiting state and just sit there forever.

In the case of theory they do something like this or they are completely unresponsive and you cant hit them at all through the vm console.

7) Message boards : Number crunching : How does task switching actually work? (Message 42761)
Posted 2 Jun 2020 by CloverField
Post:
The main issue with that for me is ATLAS loves to give me 8 core tasks. Which then kick 8 other jobs to the side, and usually break them.
What I want my tasks to do is say hey an atlas 8 core is ready. Let 8 more tasks finish and then slot the atlas in the free space. I could limit ATLAS to one core task,
but that kinds defeats the point of a threadrippper no?
8) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42738)
Posted 1 Jun 2020 by CloverField
Post:
Should a news post be made for the solution to this issue so everyone gets a notice in there BOINC client?
9) Message boards : Sixtrack Application : Internet access OK - project servers may be temporarily down. (Message 42712)
Posted 30 May 2020 by CloverField
Post:
The last 5 hours I have not been able to send any of the over 100 finished Sixtracks from here ( PDT)
Can't get new tasks either but since I planned ahead I still have 645 running.
Couldn't even send in one Theory to -dev and it can't be because of too many tasks at the same time.
Of course right now we have lots of Sixtracks running and supposed to be many more waiting right now.
I guess I will just watch and see if they finally let me update.


Are you on windows?
If so this is the actual issue.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5441
10) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42708)
Posted 30 May 2020 by CloverField
Post:
Alot of people are about to find out about this the hard way
turns out alot of people were using this cert provider.

https://twitter.com/sleevi_/status/1266647545675210753
11) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42702)
Posted 30 May 2020 by CloverField
Post:
Seems to be fixed with the workaround on
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14006&postid=96882

LHC & Rosetta both seem to work. Other projects still work.


Can confirm that this works as well.

Hopefully the BOINC team will be able to get a new build out with the new certs as well before everything breaks.
12) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42690)
Posted 30 May 2020 by CloverField
Post:
Add NumberFields@home as another project affected.

Unfortunately, opening ca-bundle.crt in Windows only shows the details for the first of the 133 certificates in the bundle. I've been through them all, and - although a few of them have expired - none expired this morning.

Although the COMODO certificate authenticating this website, and the InCommon certificate authenticating the NumberFields and Rosetta websites, all seem to be in order, I've seen a suggestion on the web that certificates may be rejected as expired in some cases when a newer certificate is issued (even if the old one appears still to have time left to run before expiry).


Just noticed this in Opera browser on Windows 10:
This discussion is fine, but this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5387
Which has images, specifically http://cms-results.web.cern.ch/cms-results/public-results/publications/SMP-15-003/CMS-SMP-15-003_Figure_006-a.png
Shows: https://www.dropbox.com/s/6qjbvllcsgslvrt/unsecure.jpg?dl=0


I can see the images just fine however I am getting a big not secure icon in the top left of chrome.
13) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42677)
Posted 30 May 2020 by CloverField
Post:
I've got the same date in there as well.
14) Message boards : Number crunching : How does task switching actually work? (Message 42671)
Posted 30 May 2020 by CloverField
Post:
Yeah I plan to build a atlas only box at some point in the future when I retire this comp.
That seems like the easiest way to fix the issue, as I'm not sure if the atlas team could adjust the deadlines.
I don't want to mess up their science just so I don't have to check on my comp once a day lol.
15) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42670)
Posted 30 May 2020 by CloverField
Post:
I also see the same, it could be the BOINC certificate expired?


But my other three projects (Universe, Milkyway, Einstein) are ok. Only Rosetta and LHC failed.

How do these certificates work? Explain like I'm five (T.M. Reddit)


Is basically a file with a cryptographic key in in that says hey you can trust me from xx/xx/xxxx to xx/xx/xxxx
if those dates go out of range you can no longer trust that connection and in this day and age most things reject that as insecure.

Edit:

Here is a much better non five year old explanation.
https://www.entrustdatacard.com/pages/ssl
16) Message boards : Number crunching : How does task switching actually work? (Message 42666)
Posted 30 May 2020 by CloverField
Post:
I don't like this task switching, it seems unnecessary. I changed "switch between applications" in Boinc to 100000 minutes, (i.e. never). Which means once you start something, finish it!

So I tried messing around with that as well and as far as I can tell that only applies when you are running multiple projects. Since I'm only running lhc at home it lets the tasks run to completion.
The actual "issue" seems to be running all the LHC@home projects at once. Since the atlas projects have such an earlier deadline then any off the other projects it likes to jump in the instant another task finishes, and since atlas tasks are multicore it will suspend the other jobs. The other virtual box projects really dont like this and it was causing me to have tons of errored/suck jobs that I would have to abort manually.
17) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 42664)
Posted 30 May 2020 by CloverField
Post:
I am also getting this.
I think LHC@home's webcerts might of expired.
:C
18) Message boards : Theory Application : Theory Task doing nothing (Message 42641)
Posted 28 May 2020 by CloverField
Post:
You have successfull Tasks for ATLAS, CMS and Theory in the last days.

When you let only sixtrack and ONE Task with VM (ATLAS, CMS or Theory) running and all other VM-Tasks suspended.
Is this Task running normal and finishing correct?

There are many sixtrack for the other 31 CPU's atm.


Yeah this would also work. I've just kinda been more focused on trying to do as much work as fast as possible lol.

Allowing each vm based task to run one instance and then filling the rest with six track would probably be the best way going forward.
That or if I build a new computer and dedicate it to ATLAS only as it seems to be the problem child with its quick deadline dates.
19) Message boards : Theory Application : Theory Task doing nothing (Message 42625)
Posted 26 May 2020 by CloverField
Post:
All the start stops
in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back
and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this
end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough
of a buffer to prevent it from starting and stopping tasks all the time.
That's the hammer on the nail.
Since the last Theory update the fictive estimated runtime went from 100 hours to 10 days.
It would be the best solution if Laurence would fix this, but for the time being you may change it yourself by editing the Theory_2019_10_01.xml in LHC's project folder.
Change the job_duration value from 864000 into 360000.


I already made that change on your advice over in number crunching.
At the time I thought the issue was only limited to CMS tasks.

However it seem getting 1 day of work and then reducing the buffer to .25 days has fixed the issue as it effectively stops boinc from getting new ATLAS tasks.
I could probably get the same result with the no new work button.
20) Message boards : Theory Application : Theory Task doing nothing (Message 42623)
Posted 26 May 2020 by CloverField
Post:
It's usually a minor problem to run many tasks concurrently but it can become a problem if they change their status.
This happens if you start/restart your BOINC client or even at shutdown when lots of data has to be saved to disk.
Modern computers with lots of cores are more affected as they run more tasks concurrently.

Nobody can really tell what's the best combination on your computer. You'll have to try it out.


So this computer is my main server box all it does is LHC@home and every so often stream a movie to my tv.
As such its configured to run boinc 100% and does not suspend when the computer is in use. All the start stops
in that last theory task are actually from when boinc goes to fetch work. What usually happens there is it will get a bunch of atlas tasks back
and since those have a earlier due date it will stop whatever is currently running and switch back to atlas, this happens multiple times a day and this
end up killing my tasks. I think I might be able to fix this by setting the keep an additional x days work setting to 1 from .25 hopefully this keeps enough
of a buffer to prevent it from starting and stopping tasks all the time.


Next 20


©2020 CERN