1) Message boards : Number crunching : Double counting? (Message 49661)
Posted 10 hours ago by computezrmle
Post:
LHC-dev exported this record for user Fardringle at 2024-02-26 14:00 UTC as part of BOINC's standard stats file everybody can download:
<user>
 <id>590</id>
 <name>Fardringle</name>
 <create_time>1659232859</create_time>
 <total_credit>2700232.728606</total_credit>
 <expavg_credit>3615.616327</expavg_credit>
 <expavg_time>1708955308.710162</expavg_time>
 <cpid>925c5a878eb0215b912cefd9328b799e</cpid>
 <country>United States</country>
 <teamid>86</teamid>
</user>

This is the record all 3rd party stats site use to create their own pages.
2) Message boards : Number crunching : Double counting? (Message 49654)
Posted 18 hours ago by computezrmle
Post:
Downloading the user stats file from -dev works fine and it reports the correct credits (at least for my account).


boincstats reports this

As for LHC@home:
Last update user XML 2024-02-26 07:15:20 UTC (02:04:55 old)
Last update host XML 2024-02-25 07:14:35 UTC (1 day 02:05:40 old)
Last update team XML 2024-02-26 07:15:20 UTC (02:04:55 old)

As for LHCathome-dev:
Last update user XML 2024-01-26 08:51:04 UTC (31 days 01:54:01 old)
Last update host XML 2024-02-25 13:00:03 UTC (21:45:02 old)
Last update team XML 2024-01-26 08:51:04 UTC (31 days 01:54:01 old)
3) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49637)
Posted 2 days ago by computezrmle
Post:
Read the kernel docs and the systemd docs.
4) Questions and Answers : Unix/Linux : I got a lot of errors does my system is correctly setup. (Message 49633)
Posted 2 days ago by computezrmle
Post:
I also have a vpn always active. maybe it can also create problems?

Yes.
ATLAS/CMS/Theory download thousands of files per task - mostly small ones but sometimes huge ones.
The project does everything to distribute those files as efficient as possible via servers/proxies as close to your home location as possible.

If you use a VPN (or much worse a network like Tor) those files are forced through deviations and bottlenecks.
This makes all efficiency efforts on the project's side useless.
5) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49632)
Posted 2 days ago by computezrmle
Post:
I personally prefer the bigger jobs.

This has been discussed forth and back, and yes, there are volunteers with fast computers and fast internet running their systems 24/7 (including mine).
Those usually do not have problems even with 2000 eventers.

On the other hand:
- There are lots of computers that are not fast enough to finish large tasks within a reasonable time
- ATLAS (native) does not support suspend/resume, hence tasks start from scratch
- ATLAS generates huge upload files

Together with other points mentioned in the past those 500 eventers were accepted as compromise.

As for long setup times.
They are usually shorter in case of
- smaller EVNT files, less events
- a local HTTP proxy is used
- CVMFS is configured to use Cloudflare’s CDN


Especially on Linux a few cgroups tweaks via systemd can be set to ensure CPU cycles are not lost during an ATLAS setup but instead given to other running tasks.
This slightly slows down an individual task but increases the total throughput of the computer.
6) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49622)
Posted 3 days ago by computezrmle
Post:
According to the logs I checked there were tasks configured to process 100, 400, 500 or 1000 events.
To vary the # events doesn't seem to be a good decision from the people who submitted the tasks since it finally unbalances BOINC's work fetch calculation, it's runtime estimation and it's credit calculation.

A while ago some tests where done showing 500 events per task are a good compromise between the project needs and most volunteers can handle without major issues.
7) Message boards : Number crunching : Server status page (Message 49556)
Posted 12 days ago by computezrmle
Post:
Might be caused by an increased Theory error rate and a recent SixTrack batch.
Both require additional BOINC results to be processed per workunit.
Assimilation happens when a workunit is complete.
8) Message boards : CMS Application : tasks now running unusual long time without CPU usage (Message 49541)
Posted 13 days ago by computezrmle
Post:
Log entries like this have been reported for more than 2 years.

There's no need to swamp the forum with posts about them now since
- they obviously never affected the scientific calculation
- they are obviously not related to the recent x509 issue which is solved
- they are only visible inside the VM when someone actively looks through the logs there (in this case "StartdLog")
- Ivan as part of the CMS team already mentioned he is aware of it and forwarded it to the developers


During a reconnect by your ISP your internet router usually (at least in Germany) gets a fresh public(!) IP v4 (and a fresh IP v6 prefix).
This does not affect the lifetime of any x509 cert created by CERN nor does it affect your LAN IPs nor CERN's IPs which are mentioned in the log message.
9) Message boards : CMS Application : no new WUs available (Message 49527)
Posted 14 days ago by computezrmle
Post:
credentials working now?

Looks like they do.
Got a bunch of CMS tasks all running fine and CERN Grafana shows an increasing number of running jobs since 08:12 UTC.
10) Message boards : CMS Application : Could not get X509 credentials (Message 49517)
Posted 15 days ago by computezrmle
Post:
Can this problem with getting a proxy credential from LHC be avoided by installing a local proxy server?

No.
The term "proxy" does not mean "HTTP proxy" in this context.
The error is caused by an essential CERN service not responding (a couple of times until the client gives up).
There's nothing that could be done on the client side.


Toby already posted the most/only useful todo:
Toby Broom wrote:
I set to NNT, no point to hammer the server for no work. Maybe it will be back on Monday
11) Message boards : Theory Application : jobs stats page (Message 49502)
Posted 16 days ago by computezrmle
Post:
Thanks.

So, the link on the project website also needs to be changed back from mcplots.cern.ch to mcplots-dev.cern.ch

Menu: Jobs -> Theory Jobs
12) Message boards : Theory Application : Website showing list with "bad" sherpa tasks (Message 49497)
Posted 16 days ago by computezrmle
Post:
Although the task in question indeed might got stuck the "failed" list is not helpful to decide whether any task should be killed or not.

That's because 1 important fact has not been mentioned:
"Therefore no task of them was successful so far..."

Especially when a new mcplots revision starts there are always a couple of runspecs that fail or get lost before they report their first success.
To get removed from the "failed" list a runspec needs to report at least 1 successful result.
13) Message boards : CMS Application : tasks now running unusual long time without CPU usage (Message 49496)
Posted 16 days ago by computezrmle
Post:
Is this the network issue?

No.
The log reports this:
2024-02-09 23:11:31 (119660): Guest Log: [ERROR] Could not connect to eoscms-ns-ip563.cern.ch on port 1094
14) Message boards : Theory Application : Website showing list with "bad" sherpa tasks (Message 49486)
Posted 17 days ago by computezrmle
Post:
few years ago

Might not be useful any more since there were major changes recently.
Most important: the switch from cvm3 to cvm4 which provides a completely different base environment.
According to mcplots the overall failure rate is 1.27 % for revision 2687.
15) Message boards : Theory Application : New native version v300.08 (Message 49431)
Posted 20 days ago by computezrmle
Post:
BOINC's runtime estimation (like the credit calculation) is not really good when real runtimes are highly variable.

In case of Theory tasks runtime can be between few seconds and up to 10 days.
ATM it looks like runtimes of many tasks are much longer than usual (a couple of days) while some weeks ago many were much shorter.
BOINC usually needs a couple of days, sometimes even weeks, to catch up and adjust the average.

The only thing that helps is to slightly modify BOINC's work buffer size.
Although there's a myth claiming it every now and then for years app_config.xml does not support a parameter that limits the number of tasks a project server sends.
16) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49399)
Posted 21 days ago by computezrmle
Post:
Take your statement and exchange sender/receiver to get:
"When CERN sends thousands of large files at full speed and only 2 recipients report occasional problems, then it's not a problem at the CERN end."

Do you see the problem?
Without deeper investigation both are not valid.

OK, I don't expect you to accept this point of view.
Instead I expect complaints about the "only 2".
Well, replace it with few more but compare that with up to 7.32 k jobs ATLAS was recently running concurrently via non-grid-BOINC as shown here.
17) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49389)
Posted 22 days ago by computezrmle
Post:
Downloaded ATLAS*.vdi.gz (1.7 GB) and CMS*.vdi.gz (1.6 GB) from lhcathome-upload.cern.ch to measure the download speed.
Since lhcathome-upload.cern.ch runs on a couple of boxes (guess how many) I tested each of them.
The FQDN is also used for ATLAS EVNT files.

Results:
min average: 1.15 MB/s
max average: 1.69 MB/s

This is less than my internet connection allows but far more than the 20-50 kB/s mentioned in the title.
The speed distribution shown in the monitoring system suggests the server is loaded but not heavily overloaded.


Now let's look at the method a well known volunteer used a couple of times to solve the problem:
Ok, when such a speed seen, disconnecting networkcard, waiting a minute and reconnect it.
.
.
.
Don't know where it come from, the most download-files have a beginning speed from 20-50 kb/s.
After disconnecting networkcard and activate again, there are mostly 9.000 kBit/s.
.
.
.
When seeing this low speed, making a disconnect of the networkcard and activate again.

These network resets cleared locked connections and freed resources on the computer's network stack and usually within the same network segment, here between the computer and the local router.

Be aware, nobody at CERN cleared the server's network stack at the same time.
It depends on the type of the reset whether the server gets immediately aware of it or after a timeout.
In the latter case the server still keeps some resources reserved for the lost connection.
Nonetheless the server obviously has enough resources to accept a new connection.


Why does it happen only for ATLAS?
Let's be more precise:
It is visible here since each ATLAS task downloads a huge file.
Few people would complain about occasional delays while downloading very small files.
Those are mostly not visible.



And the conclusion?
As already said, just claiming "it's not on my side" is not a valid evidence.
Even the tests above are not a valid evidence but they may give some hints to extend the view.
18) Message boards : Number crunching : Getting no work for many many months (Message 49387)
Posted 22 days ago by computezrmle
Post:
Most important:
Focus on your health.
Projects like the one here are much less important.



Run this command in an elevated command window:
bcdedit /set hypervisorlaunchtype off

Then reboot.

Also recommended:
Update VirtualBox to the most recent version:
https://www.virtualbox.org/wiki/Downloads
19) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49371)
Posted 23 days ago by computezrmle
Post:
The problem most certainly does not lie outside CERN.

Just a statement. Not a valid evidence either to blame CERN.
The relevant point is that a speed drop can happen anywhere between the connection endpoints.

This may be at CERN but it also may be a local problem.
- your computer (network stack)
- your router (temporarily out of resources)
- your ISP
- another ISP somewhere in the middle of the route
- (add other reasons)
20) Message boards : Number crunching : Computation Errors (Message 49364)
Posted 24 days ago by computezrmle
Post:
A similar VirtualBox error has been described here together with steps to solve it:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6079&postid=49046

Instead of the CMS vdi you need to cleanup the Theory vdi.


Next 20


©2024 CERN