1) Message boards : ATLAS application : All tasks failing (Message 50653)
Posted 26 Sep 2024 by hadron
Post:
Not sure if that 6GB tasks are still sent. Have not gotten a big one the last hours.

That's probably because the queue is empty right now.
2) Message boards : ATLAS application : All tasks failing (Message 50651)
Posted 26 Sep 2024 by hadron
Post:
Did you have a look with VirtualBox Manager - Tools - Media, whether you maybe have child media with exclamation marks.

None
3) Message boards : ATLAS application : All tasks failing (Message 50637)
Posted 26 Sep 2024 by hadron
Post:
Since about 23:30 25 Sept, I have had only one successful task run to completion. All the others have been failing with this in the stderr_txt:
2024-09-25 20:01:36 (15434): 
Command: VBoxManage -q storageattach "boinc_674f437b0a9c5e28" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_3.01_image.vdi" 
Exit Code: -2135228409
Output:
VBoxManage: error: Cannot attach medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_3.01_image.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later
VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports
VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 785 of file VBoxManageStorageController.cpp

2024-09-25 20:01:36 (15434): 
Command: VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_3.01_image.vdi" 
Exit Code: 0
Output:
4) Message boards : CMS Application : no new WUs available (Message 50531)
Posted 28 Jul 2024 by hadron
Post:
Got fresh tasks/jobs and so far all of them are running fine.

I was premature with my comment about all of them being do-nothing tasks.
There were only a few at first (likely old stuff being sent out again for whatever reason), and then the "real" tasks started up here ---
then I got busy doing other stuff, and have only got back here now to report.
5) Message boards : CMS Application : no new WUs available (Message 50528)
Posted 27 Jul 2024 by hadron
Post:
There are new tasks, yes, but so far they all seem to be more of those do-nothing tasks.
6) Message boards : CMS Application : What is this nonsense? (Message 50523)
Posted 26 Jul 2024 by hadron
Post:
hadron it's your computer. 25611 credits
In all your CMS tasks they do not go below 25,000 credits

But not only that, with the previous CMS finished with 150-second CPU calculations, you already had 1150 credits. Your topic comes from a long time ago, not from now.

Your computer receives a x10 credit bonus for the same task performed on another computer.

No, it is not my computer. Credit is handed out by the LHC@H servers, I believe. I think at most, it may be my BOINC client needing to re-calibrate the credit it claims now that a) the tasks require 4 threads and take 12 or 13 hours to complete, and b) for a rather long time, the do-nothing tasks were running on 2 threads and taking only minutes.
Once again, this is not an old issue; it is completely new. I have already said I don't care about what was going on before; I have said more than once my belief that the do-nothing tasks were just the CMS project making sure all the kinks and bugs were out of the system before they restarted the real work.
7) Message boards : CMS Application : What is this nonsense? (Message 50521)
Posted 26 Jul 2024 by hadron
Post:
Hello,
the points don't matter to me, I offer my computer to help with scientific tasks when I don't use it, I don't expect anything in return, the hardware in computers unfortunately becomes obsolete in a matter of years, a multitude of computers have simply been wasted use that for video games.

Thank you for your input. Points don't matter to me either. I only use them to determine if things are running smoothly. If the recent average for a project is fairly constant over time, then things are probably OK, and don't need much attention from me; if it starts to drop unexpectedly, then I am interested in knowing why.
As for my computer, it is all less than 2 years old. The CPU and system board are Ryzen 9/AM4, so this is certainly not an issue.

You as an individual can decide how you want your machine to work, but what you cannot do is demand that others who want to use your machine adapt to your preferences if they do not match what they are looking for for their results and program.

Of course they have a right to determine minimum requirements. I would simply like to be advised well in advance if those requirements are going to change. CMS tasks used to run on fewer than 4 threads; now they will not. There was, as far as I am aware, no advance warning of this change. It would have been nice to know; then I would not have had over 60 tasks fail with no clear indication of why.

In your case and that of some others, you take tasks where a task was executed without any calculation work, in the case of others, they basically get more than 5000 tasks that all become aborted, giving enormous work to the servers and day after day day and day after day 24 hours/7 days.

I already addressed this: I assume they were doing final testing to make sure everything worked properly before sending out any real tasks that performed meaningful scientific work.

What I don't understand is why the difference in bonus points for the same CPU work time in CMS, if you run it on Linux they give you 30,000 points, if you run it on Windows they give you 3,000 points... I don't care but to others....

That is very strange. This is the first I've known about this. Credit for work done should be based on the amount of actual work done, not on which operating system one is using.
8) Message boards : CMS Application : What is this nonsense? (Message 50516)
Posted 26 Jul 2024 by hadron
Post:
The CMS patch activated last night affects the process inside the VM.
It has nothing to do with BOINC (especially the work fetch).
Hence, BOINC related issues are not caused by the CMS patch.

That is less than useless. This is not a BOINC-related issue; this is a CMS issue.

Once again: CMS tasks now will not run on less than 4 threads. Why? Maybe Ivan can offer some light on this? Please?
9) Message boards : CMS Application : What is this nonsense? (Message 50513)
Posted 26 Jul 2024 by hadron
Post:
@Ivan
+1


@hadron
Did you really believe that your computer delivered valid results within 30 min/2.5 min while other computers need many hours?
How naive!

In reality your computer got credits for empty envelopes without any scientific payload (for many weeks!).
This has now been stopped.
Like all volunteers you have to respect the requirements and set up your VMs accordingly.
Your choice is to either do so or to leave.

Please tell me then, just why can I not set these tasks to run on only 2 threads? Every task running on only 2 has failed. That has never happened before.
Now, the tasks will not run unless I give them 4 threads. Why?
This does not happen with Atlas; those run just fine with however many threads I allow them to have, from 1 to 8. Why can CMS tasks not be configured the same way?
And no, I was not naive when all those tasks were completing with no real work being done. I just assumed that was because they were testing things to make sure they had got it right. Back then, I could set CMS tasks to run on any number of threads between 1 and 4, and they completed just fine. Now they will not -- they must be given 4 threads, or they will fail.
If all you have to offer is "like it or leave", then I think your help desk "expert" credentials are in serious need of review.
10) Message boards : CMS Application : What is this nonsense? (Message 50510)
Posted 26 Jul 2024 by hadron
Post:
My next question is one I know you will not/can not have an answer for, maeax -- just what gives anyone at CERN the right to dictate how I am allowed to allocate the resources of my computer?

We don't dictate how you allocate your resources. We do specify what resources are required to properly run our simulations.

OK, so are you saying that yes, CMS tasks will only run if they are allocated 4 threads?
If that is true, then you most certainly are dictating how I allocate my resources.
11) Message boards : CMS Application : What is this nonsense? (Message 50508)
Posted 26 Jul 2024 by hadron
Post:
Ok,
but, what if Cern-IT had changed it in the .xml?

You're just speculating. Anyway, that file is overridden by whatever is in the app_config.xml file.
I've set my config back to 2 threads per task to see if the error returns -- and yes, it has. 5 tasks failed within 14 to 16 minutes.
Back to 4 CPUs, even if they do take 12 to 13 hours to run.

My next question is one I know you will not/can not have an answer for, maeax -- just what gives anyone at CERN the right to dictate how I am allowed to allocate the resources of my computer?
12) Message boards : CMS Application : What is this nonsense? (Message 50506)
Posted 26 Jul 2024 by hadron
Post:
2024-07-25 16:12:18 (21088): VM Completion Message: VM expects at least 4 CPUs but reports only 2.

Yes, maeax. I can read, but you haven't even tried to answer my question.
Once again, is this the reason why all those CMS tasks failed after only 14 to 16 minutes?
13) Message boards : CMS Application : What is this nonsense? (Message 50504)
Posted 25 Jul 2024 by hadron
Post:
I have now had 52 CMS tasks in a row fail and all the logs I've checked show these messages:

2024-07-25 15:59:32 (21088): Guest Log: [INFO] Requesting an idtoken from LHC@home
2024-07-25 15:59:33 (21088): Guest Log: [INFO] CMS application starting. Check log files.
2024-07-25 16:11:48 (21088): Guest Log: [ERROR] VM expects at least 4 CPUs but reports only 2.
2024-07-25 16:11:48 (21088): Guest Log: [DEBUG] Volunteer: hadron (806228)
2024-07-25 16:11:48 (21088): Guest Log: [INFO] Shutting Down.
2024-07-25 16:12:18 (21088): VM Completion File Detected.
2024-07-25 16:12:18 (21088): VM Completion Message: VM expects at least 4 CPUs but reports only 2.

So what? Is CMS now demanding that I must run tasks on 4 CPUs?

This is the <app_version> section for CMS:
<app_version>
        <app_name>CMS</app_name>
        <avg_ncpus>2</avg_ncpus>
        <plan_class>vbox64_mt_mcore_cms</plan_class>
        <cmdline>--nthreads 2</cmdline>
    </app_version>
It's been like this since CMS became capable of running on multiple threads, without problem until now.

So I have changed the settings to run the tasks on 4 threads, and so far, things are looking OK.
14) Message boards : Number crunching : ALT-F2 Console Interface (Message 50486)
Posted 15 Jul 2024 by hadron
Post:
This is probably something you should be asking on the BOINC website.
15) Message boards : CMS Application : no new WUs available (Message 50469)
Posted 7 Jul 2024 by hadron
Post:
If you have not noticed it, all the CMS tasks are being reported immediately. Check the client_state.xml file in the boinc directory and you will find <report_immediately/> for every one of them. This is something I would not expect to see if the tasks included a data payload.
That makes me wonder if what we are getting right now is some massive test of the software before actual work payloads are sent out. See Ivan's post of 27 June: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4209&postid=50460
16) Questions and Answers : Windows : you need Virtualbox installed and virtualbox extension pack or your tasks will be "error" (Message 50445)
Posted 24 Jun 2024 by hadron
Post:
This copy is of a user's Theory task, there are many more, all their tasks end the same (error), in all of them there is no extension pack and they continue receiving and sending error tasks day after day.

<snip>

2024-06-10 16:36:29 (4644): forwarding host port 57780 to guest port 80
2024-06-10 16:36:29 (4644): Enabling remote desktop for VM.
*****2024-06-10 16:36:30 (4644): Required extension pack not installed, remote desktop not enabled.*****
2024-06-10 16:36:30 (4644): Enabling shared directory for VM.
2024-06-10 16:36:30 (4644): Starting VM using VBoxManage interface. (boinc_a90b14600f290613, slot#5)
2024-06-10 16:36:34 (4644): Error in start VM for VM: -2147467259

Are you sure it's not necessary?

Remote hosts have been enabled on this computer, and apparently that requires the extension pack to be installed. Check the cc_config.xml file to see if
<allow_remote_gui_rpc>0|1</allow_remote_gui_rpc>

has been set to 1.
If you do not need to connect from any other computer, set this value to 0 and read the configuration files (Advanced view in the GUI Boinc manager, Options/Read config files). If you do need to make such connections, just install the correct extension pack. In that case, it may be necessary to restart Boinc.
17) Message boards : Theory Application : taks 411433609 10 days/24h ......error (Message 50385)
Posted 10 Jun 2024 by hadron
Post:
After 10 days/24 hours is what it showed me,(error) I have another task that takes 3 days and the period is also estimated to be 10 days.

Should I cancel it or let it happen?

The uploaded details for that failed task show a file transfer error:
2024-06-10 00:58:21 (14460): Status Report: Job Duration: '864000.000000'
2024-06-10 00:58:21 (14460): Status Report: Elapsed Time: '864000.554790'
2024-06-10 00:58:21 (14460): Status Report: CPU Time: '711710.531250'
2024-06-10 00:58:21 (14460): Powering off VM.
2024-06-10 00:58:22 (14460): Successfully stopped VM.
2024-06-10 00:58:22 (14460): Deregistering VM. (boinc_70006df3a91772f6, slot#0)
2024-06-10 00:58:22 (14460): Removing network bandwidth throttle group from VM.
2024-06-10 00:58:22 (14460): Removing VM from VirtualBox.
2024-06-10 00:58:27 (14460): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>Theory_2743-2857705-222_0_r1232496514_result</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

From this it appears that the task completed successfully, but there was an unrecoverable error in transmitting the results file to the LHC server.
However, that is not certain. The task continued right up until the job limit of 864000 seconds (run time) and then stopped. There is no indication in the above that the task actually finished processing all the events.
You may be able to find further details in the stdoutdae.txt file in the boinc directory.
You say the other task has been running 3 days (maybe 4 now, since you posted this yesterday), which means it would have been sent to you probably on June 8. All the in progress Theory tasks listed for you were sent to you after 0600 UTC today, so I do not know what task you might be talking about.
From experience, I can tell you that 3 days is far too early in this kind of situation for you to be making a decision on whether or not to abort the task -- except for one situation. Check the properties for this task and compare the CPU time and the elapsed time. These should be reasonably close to each other. For a task running several days, "reasonably close" might mean one or two hours. If this is the case, let the task keep running. If not, inspect the task properties twice, exactly 5 minutes apart. Each time, write down both times, CPU and elapsed.
Now note the change in the CPU time. It should be within seconds of 5 minutes. If it is not, then the process has essentially stopped doing meaningful work towards completing the task, and you may abort it.
18) Message boards : ATLAS application : Download failures (Message 50340)
Posted 5 Jun 2024 by hadron
Post:
Ah, I'm on 7.24.1. Nobody told me there was a version 8 out. You'd think they could put an auto update in there or at least a notification!

I'm running that version, and am not having any problems with file downloads.
19) Message boards : Theory Application : New version v300.20 (Message 50218)
Posted 20 May 2024 by hadron
Post:
/dev/shm is a virtual drive that exists only in memory, used for Linux programs to efficiently pass data to each other -- so a reboot will clear everything in it.
However, I am a firm believer in not rebooting a system unless it is absolutely necessary, and getting rid of a single miscreant lockfile is not IMO sufficient cause to reboot.
So no, I have not rebooted my system since the 11th. In fact, the last time I did reboot was after a system upgrade on May 4.
20) Message boards : Theory Application : New version v300.20 (Message 50216)
Posted 20 May 2024 by hadron
Post:
Problem solved. Fortunately, being an experienced Linux user, I was able to use that nearly-useless post to go digging.

There was a lockfile in /dev/shm that was created not very long before the problems started:

# ls -hal /dev/shm/
total 0
drwxrwxrwt  2 root  root    60 May 19 19:09 .
drwxr-xr-x 22 root  root  4.8K May  4 20:47 ..
-rw-------  1 boinc boinc    0 May 11 09:48 boinc_vboxwrapper_lock_e086e43dd21d28b7

UTC-0600 here, that is May 11 1548 UTC. The very first Theory task to fail after only 3 minutes was reported at 1627 UTC.

Now, that very same lockfile is mentioned in the stderr.txt of every single failed Theory task, twice in fact:
...
2024-05-19 18:50:33 (23676): Could not set race mitigation lock.
2024-05-19 18:50:33 (23676): Lockname: '/boinc_vboxwrapper_lock_e086e43dd21d28b7'
2024-05-19 18:50:33 (23676): Error: ERR_TIMEOUT
2024-05-19 18:50:33 (23676): Attempts: 48
2024-05-19 18:50:33 (23676): Could not set race mitigation lock in 'create_vm'.
2024-05-19 18:50:33 (23676): Could not create VM
2024-05-19 18:50:33 (23676): ERROR: VM failed to start
2024-05-19 18:50:33 (23676): Powering off VM.
2024-05-19 18:50:33 (23676): Deregistering VM. (boinc_372f2cc2be2d23c7, slot#9)
2024-05-19 18:50:33 (23676): Removing network bandwidth throttle group from VM.
2024-05-19 18:52:04 (23676): Could not set race mitigation lock.
2024-05-19 18:52:04 (23676): Lockname: '/boinc_vboxwrapper_lock_e086e43dd21d28b7'
2024-05-19 18:52:04 (23676): Error: ERR_TIMEOUT
2024-05-19 18:52:04 (23676): Attempts: 48
2024-05-19 18:52:04 (23676): Could not set race mitigation lock in 'deregister_vm'.
2024-05-19 18:52:04 (23676): Warning: Will continue without a lock.
2024-05-19 18:52:04 (23676): Removing VM from VirtualBox.

I believe that, when the task is first being set up, Boinc sets that lockfile for whatever reason the programmers find desirable/necessary, and then is supposed to delete it when it is no longer necessary. It looks like this one didn't get deleted for some reason -- a bug perhaps? One that is only rarely encountered?
This all screams, "Delete the file and try again," so that is what I did. No lockfile, no failed tasks -- and definitely no reboot needed; I now have one task waiting to be reported, and 4 more running quite happily.


Next 20


©2024 CERN